Dataset Info

This dataset comprises information of 25000 mutual funds in the United states. Various attributes related to the mutual fund have been described and these attributes will be used for making decisions on the rating of the mutual fund by GreatStone which is a top mutual fund rating agency. The following files are provided in the form of CSVs. These files contain various attributes related to the mutual fund. Please find the following files for the same: bond_ratings, fund_allocations, fund_config, fund_ratios, fundspecs, other specs, return_3year, return_5year, return_10year.

Domain:

Mutual Fund - Finance

Objective:

The goal of this hackathon is to predict GreatStone’s rating of a mutual fund. In order to help investors decide on which mutual fund to pick for an investment, the task is to build a model that can predict the rating of a mutual fund. The various attributes that define a mutual fund can be used for building the model

In [4]:
###  Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import pandas_profiling
sns.set(rc={'figure.figsize':(13.7,8.27)}) # setting constant to increase seaborn graph sizes
In [5]:
# Read the data as a data frame
#bond_ratings consists of 12 columns which provide information on the bond rating percentage allocation of the mutual funds
#The tag column is a unique identifier and is also the same as the id.(i.e tag = id)
bond_ratings = pd.read_csv('Hackathon_Files/external/bond_ratings.csv')
pandas_profiling.ProfileReport(bond_ratings)
Out[5]:

Overview

Dataset info

Number of variables 12
Number of observations 25000
Total Missing (%) 11.0%
Total size in memory 2.3 MiB
Average record size in memory 96.0 B

Variables types

Numeric 11
Categorical 0
Boolean 1
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

Variables

a_rating
Numeric

Distinct count 1582
Unique (%) 6.3%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.0544
Minimum 0
Maximum 72.87
Zeros (%) 65.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 8.34
95-th percentile 25.7
Maximum 72.87
Range 72.87
Interquartile range 8.34

Descriptive statistics

Standard deviation 9.2618
Coef of variation 1.8324
Kurtosis 5.8703
Mean 5.0544
MAD 6.8295
Skewness 2.2998
Sum 125780
Variance 85.781
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 16262 65.0%
 
10.5 49 0.2%
 
11.98 44 0.2%
 
10.84 33 0.1%
 
11.48 33 0.1%
 
8.21 28 0.1%
 
5.64 26 0.1%
 
11.56 25 0.1%
 
4.43 24 0.1%
 
12.42 24 0.1%
 
Other values (1571) 8338 33.4%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 16262 65.0%
 
0.01 7 0.0%
 
0.02 10 0.0%
 
0.04 6 0.0%
 
0.05 3 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
59.94 3 0.0%
 
60.42 2 0.0%
 
60.93 7 0.0%
 
66.64 3 0.0%
 
72.87 1 0.0%
 

aa_rating
Numeric

Distinct count 1404
Unique (%) 5.6%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.2091
Minimum -0.19
Maximum 90.22
Zeros (%) 66.0%

Quantile statistics

Minimum -0.19
5-th percentile 0
Q1 0
Median 0
Q3 3.01
95-th percentile 30.282
Maximum 90.22
Range 90.41
Interquartile range 3.01

Descriptive statistics

Standard deviation 11.165
Coef of variation 2.6525
Kurtosis 14.38
Mean 4.2091
MAD 6.0853
Skewness 3.7139
Sum 104750
Variance 124.65
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 16499 66.0%
 
3.94 54 0.2%
 
1.5 53 0.2%
 
3.04 49 0.2%
 
5.05 47 0.2%
 
1.66 42 0.2%
 
3.65 41 0.2%
 
0.6 38 0.2%
 
3.33 38 0.2%
 
4.68 38 0.2%
 
Other values (1393) 7987 31.9%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-0.19 7 0.0%
 
-0.02 3 0.0%
 
-0.01 1 0.0%
 
0.0 16499 66.0%
 
0.01 8 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
80.42 1 0.0%
 
81.36 7 0.0%
 
84.39 1 0.0%
 
85.68 1 0.0%
 
90.22 1 0.0%
 

aaa_rating
Numeric

Distinct count 2008
Unique (%) 8.0%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 14.558
Minimum -3.15
Maximum 118.65
Zeros (%) 63.1%

Quantile statistics

Minimum -3.15
5-th percentile 0
Q1 0
Median 0
Q3 18.955
95-th percentile 72.31
Maximum 118.65
Range 121.8
Interquartile range 18.955

Descriptive statistics

Standard deviation 25.637
Coef of variation 1.761
Kurtosis 1.7023
Mean 14.558
MAD 20.163
Skewness 1.6867
Sum 362300
Variance 657.25
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 15780 63.1%
 
100.0 221 0.9%
 
53.38 34 0.1%
 
73.18 34 0.1%
 
72.2 29 0.1%
 
99.99 26 0.1%
 
1.4 25 0.1%
 
31.83 24 0.1%
 
87.93 24 0.1%
 
12.3 24 0.1%
 
Other values (1997) 8665 34.7%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-3.15 1 0.0%
 
-0.87 3 0.0%
 
-0.66 1 0.0%
 
-0.41 7 0.0%
 
-0.38 3 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
100.0 221 0.9%
 
100.06 5 0.0%
 
100.45 2 0.0%
 
105.51 5 0.0%
 
118.65 1 0.0%
 

b_rating
Numeric

Distinct count 1152
Unique (%) 4.6%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.2344
Minimum -0.12
Maximum 80.68
Zeros (%) 70.9%

Quantile statistics

Minimum -0.12
5-th percentile 0
Q1 0
Median 0
Q3 0.71
95-th percentile 21.032
Maximum 80.68
Range 80.8
Interquartile range 0.71

Descriptive statistics

Standard deviation 9.1972
Coef of variation 2.8435
Kurtosis 16.97
Mean 3.2344
MAD 5.012
Skewness 3.9549
Sum 80491
Variance 84.588
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 17727 70.9%
 
0.43 55 0.2%
 
0.32 51 0.2%
 
0.01 38 0.2%
 
0.71 33 0.1%
 
0.7 33 0.1%
 
0.74 30 0.1%
 
0.39 29 0.1%
 
1.2 28 0.1%
 
3.44 28 0.1%
 
Other values (1141) 6834 27.3%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-0.12 7 0.0%
 
0.0 17727 70.9%
 
0.01 38 0.2%
 
0.02 17 0.1%
 
0.03 16 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
68.76 4 0.0%
 
72.35 4 0.0%
 
73.88 6 0.0%
 
77.31 4 0.0%
 
80.68 6 0.0%
 

bb_rating
Numeric

Distinct count 1272
Unique (%) 5.1%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.4738
Minimum 0
Maximum 80.47
Zeros (%) 66.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 2.45
95-th percentile 22.01
Maximum 80.47
Range 80.47
Interquartile range 2.45

Descriptive statistics

Standard deviation 8.2997
Coef of variation 2.3892
Kurtosis 12.986
Mean 3.4738
MAD 5.0468
Skewness 3.3951
Sum 86449
Variance 68.886
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 16658 66.6%
 
0.03 79 0.3%
 
0.02 49 0.2%
 
1.69 43 0.2%
 
1.06 41 0.2%
 
4.5 36 0.1%
 
0.83 34 0.1%
 
4.08 31 0.1%
 
3.18 31 0.1%
 
6.51 29 0.1%
 
Other values (1261) 7855 31.4%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 16658 66.6%
 
0.01 2 0.0%
 
0.02 49 0.2%
 
0.03 79 0.3%
 
0.04 19 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
67.44 2 0.0%
 
67.89 3 0.0%
 
69.26 4 0.0%
 
72.46 1 0.0%
 
80.47 1 0.0%
 

bbb_rating
Numeric

Distinct count 1645
Unique (%) 6.6%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.1263
Minimum 0
Maximum 98
Zeros (%) 63.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 11.39
95-th percentile 27.44
Maximum 98
Range 98
Interquartile range 11.39

Descriptive statistics

Standard deviation 10.598
Coef of variation 1.7299
Kurtosis 6.5009
Mean 6.1263
MAD 8.0958
Skewness 2.2379
Sum 152460
Variance 112.32
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 15797 63.2%
 
11.64 38 0.2%
 
4.3 36 0.1%
 
14.63 35 0.1%
 
12.96 33 0.1%
 
18.33 33 0.1%
 
18.82 33 0.1%
 
22.1 31 0.1%
 
13.99 30 0.1%
 
12.0 30 0.1%
 
Other values (1634) 8790 35.2%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 15797 63.2%
 
0.01 1 0.0%
 
0.02 3 0.0%
 
0.03 5 0.0%
 
0.05 5 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
71.56 10 0.0%
 
73.5 2 0.0%
 
78.29 6 0.0%
 
84.0 6 0.0%
 
98.0 1 0.0%
 

below_b_rating
Numeric

Distinct count 659
Unique (%) 2.6%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.82752
Minimum -0.02
Maximum 42.3
Zeros (%) 72.6%

Quantile statistics

Minimum -0.02
5-th percentile 0
Q1 0
Median 0
Q3 0.1
95-th percentile 4.72
Maximum 42.3
Range 42.32
Interquartile range 0.1

Descriptive statistics

Standard deviation 2.7
Coef of variation 3.2628
Kurtosis 51.951
Mean 0.82752
MAD 1.2977
Skewness 6.1467
Sum 20594
Variance 7.2901
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 18160 72.6%
 
0.03 102 0.4%
 
0.01 91 0.4%
 
0.1 64 0.3%
 
0.09 61 0.2%
 
0.75 59 0.2%
 
0.17 58 0.2%
 
0.27 57 0.2%
 
0.14 57 0.2%
 
0.2 54 0.2%
 
Other values (648) 6123 24.5%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-0.02 5 0.0%
 
0.0 18160 72.6%
 
0.01 91 0.4%
 
0.02 48 0.2%
 
0.03 102 0.4%
 

Maximum 5 values

Value Count Frequency (%)  
34.3 9 0.0%
 
35.91 4 0.0%
 
38.83 3 0.0%
 
39.0 1 0.0%
 
42.3 3 0.0%
 

duration_bond
Numeric

Distinct count 799
Unique (%) 3.2%
Missing (%) 60.5%
Missing (n) 15126
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.6431
Minimum -3.01
Maximum 25
Zeros (%) 0.0%

Quantile statistics

Minimum -3.01
5-th percentile 0.7165
Q1 3.5
Median 4.8
Q3 5.76
95-th percentile 7.49
Maximum 25
Range 28.01
Interquartile range 2.26

Descriptive statistics

Standard deviation 2.2671
Coef of variation 0.48828
Kurtosis 7.2024
Mean 4.6431
MAD 1.5911
Skewness 1.093
Sum 45846
Variance 5.1398
Memory size 195.4 KiB
Value Count Frequency (%)  
5.39 83 0.3%
 
5.8 75 0.3%
 
3.04 56 0.2%
 
5.57 55 0.2%
 
5.77 53 0.2%
 
4.44 53 0.2%
 
5.6 53 0.2%
 
4.72 50 0.2%
 
5.32 49 0.2%
 
4.84 48 0.2%
 
Other values (788) 9299 37.2%
 
(Missing) 15126 60.5%
 

Minimum 5 values

Value Count Frequency (%)  
-3.01 6 0.0%
 
-2.91 4 0.0%
 
-1.97 6 0.0%
 
-1.61 5 0.0%
 
-1.6 5 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
18.54 2 0.0%
 
21.96 1 0.0%
 
24.04 2 0.0%
 
24.25 1 0.0%
 
25.0 2 0.0%
 

maturity_bond
Numeric

Distinct count 1045
Unique (%) 4.2%
Missing (%) 67.6%
Missing (n) 16907
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.7654
Minimum 0
Maximum 29.3
Zeros (%) 0.1%

Quantile statistics

Minimum 0
5-th percentile 2.212
Q1 5.46
Median 7.29
Q3 8.92
95-th percentile 17.044
Maximum 29.3
Range 29.3
Interquartile range 3.46

Descriptive statistics

Standard deviation 4.1486
Coef of variation 0.53423
Kurtosis 2.7326
Mean 7.7654
MAD 2.8679
Skewness 1.3526
Sum 62846
Variance 17.211
Memory size 195.4 KiB
Value Count Frequency (%)  
7.99 60 0.2%
 
8.33 43 0.2%
 
5.6 42 0.2%
 
8.0 42 0.2%
 
7.17 37 0.1%
 
10.78 37 0.1%
 
7.34 36 0.1%
 
7.05 36 0.1%
 
7.39 35 0.1%
 
5.71 35 0.1%
 
Other values (1034) 7690 30.8%
 
(Missing) 16907 67.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 15 0.1%
 
0.01 5 0.0%
 
0.07 2 0.0%
 
0.12 7 0.0%
 
0.14 7 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
25.51 3 0.0%
 
26.29 1 0.0%
 
27.12 5 0.0%
 
27.79 1 0.0%
 
29.3 2 0.0%
 

others_rating
Numeric

Distinct count 992
Unique (%) 4.0%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.6668
Minimum -68.21
Maximum 100
Zeros (%) 66.4%

Quantile statistics

Minimum -68.21
5-th percentile 0
Q1 0
Median 0
Q3 0.33
95-th percentile 8.0075
Maximum 100
Range 168.21
Interquartile range 0.33

Descriptive statistics

Standard deviation 6.8852
Coef of variation 4.1308
Kurtosis 88.788
Mean 1.6668
MAD 2.7103
Skewness 8.1244
Sum 41479
Variance 47.405
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 16602 66.4%
 
0.09 131 0.5%
 
0.01 131 0.5%
 
0.05 112 0.4%
 
0.1 86 0.3%
 
0.06 79 0.3%
 
0.08 77 0.3%
 
0.16 76 0.3%
 
0.03 73 0.3%
 
1.0 70 0.3%
 
Other values (981) 7449 29.8%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-68.21 1 0.0%
 
-49.55 5 0.0%
 
-19.82 3 0.0%
 
-18.65 1 0.0%
 
-18.08 8 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
83.4 9 0.0%
 
92.81 5 0.0%
 
95.67 8 0.0%
 
99.74 5 0.0%
 
100.0 14 0.1%
 

tag
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

us_govt_bond_rating
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.5%
Missing (n) 114
Mean 0
0.0
24886
(Missing)
 
114
Value Count Frequency (%)  
0.0 24886 99.5%
 
(Missing) 114 0.5%
 

Correlations

Sample

bb_rating us_govt_bond_rating below_b_rating others_rating maturity_bond b_rating tag a_rating aaa_rating aa_rating bbb_rating duration_bond
0 0.0 0.0 0.0 0.0 NaN 0.0 67922 0.0 0.0 0.0 0.0 NaN
1 0.0 0.0 0.0 0.0 NaN 0.0 134783 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN 0.0 61271 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN 0.0 64412 0.0 0.0 0.0 0.0 NaN
4 0.0 0.0 0.0 0.0 NaN 0.0 184058 0.0 0.0 0.0 0.0 NaN
In [6]:
#fund_allocations consists of 12 columns which provide information on the sector wise percentage allocation of the mutual funds
fund_allocations = pd.read_csv('Hackathon_Files/external/fund_allocations.csv')
pandas_profiling.ProfileReport(fund_allocations)
Out[6]:

Overview

Dataset info

Number of variables 12
Number of observations 25000
Total Missing (%) 0.4%
Total size in memory 2.3 MiB
Average record size in memory 96.0 B

Variables types

Numeric 12
Categorical 0
Boolean 0
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

Variables

id
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

portfolio_communication_allocation
Numeric

Distinct count 982
Unique (%) 3.9%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.2723
Minimum 0
Maximum 100
Zeros (%) 41.1%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1.18
Q3 3.41
95-th percentile 7.08
Maximum 100
Range 100
Interquartile range 3.41

Descriptive statistics

Standard deviation 4.4046
Coef of variation 1.9384
Kurtosis 187.33
Mean 2.2723
MAD 2.2526
Skewness 10.591
Sum 56548
Variance 19.401
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 10266 41.1%
 
3.5 99 0.4%
 
3.49 98 0.4%
 
2.42 94 0.4%
 
2.45 89 0.4%
 
3.41 85 0.3%
 
3.56 78 0.3%
 
2.52 74 0.3%
 
3.0 73 0.3%
 
3.57 71 0.3%
 
Other values (971) 13859 55.4%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 10266 41.1%
 
0.01 71 0.3%
 
0.02 60 0.2%
 
0.03 50 0.2%
 
0.04 14 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
54.88 1 0.0%
 
58.81 4 0.0%
 
80.84 3 0.0%
 
93.31 5 0.0%
 
100.0 12 0.0%
 

portfolio_consumer_defence_allocation
Numeric

Distinct count 1555
Unique (%) 6.2%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.1113
Minimum 0
Maximum 100
Zeros (%) 35.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 4.84
Q3 7.88
95-th percentile 13.77
Maximum 100
Range 100
Interquartile range 7.88

Descriptive statistics

Standard deviation 6.0785
Coef of variation 1.1892
Kurtosis 54.598
Mean 5.1113
MAD 4.2654
Skewness 4.6174
Sum 127200
Variance 36.948
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 8833 35.3%
 
7.76 79 0.3%
 
7.95 77 0.3%
 
7.67 71 0.3%
 
7.56 67 0.3%
 
7.87 67 0.3%
 
6.64 64 0.3%
 
8.45 61 0.2%
 
6.61 60 0.2%
 
7.78 60 0.2%
 
Other values (1544) 15447 61.8%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 8833 35.3%
 
0.01 39 0.2%
 
0.02 4 0.0%
 
0.03 6 0.0%
 
0.04 17 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
94.66 5 0.0%
 
98.05 4 0.0%
 
98.75 1 0.0%
 
99.91 7 0.0%
 
100.0 3 0.0%
 

portfolio_cyclical_consumer_allocation
Numeric

Distinct count 1975
Unique (%) 7.9%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.2116
Minimum 0
Maximum 100
Zeros (%) 30.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 10.46
Q3 13.21
95-th percentile 21.24
Maximum 100
Range 100
Interquartile range 13.21

Descriptive statistics

Standard deviation 9.701
Coef of variation 1.0531
Kurtosis 28.534
Mean 9.2116
MAD 6.6222
Skewness 3.6464
Sum 229240
Variance 94.109
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 7497 30.0%
 
11.84 89 0.4%
 
100.0 75 0.3%
 
11.54 67 0.3%
 
12.89 61 0.2%
 
11.55 55 0.2%
 
11.82 53 0.2%
 
11.73 53 0.2%
 
11.9 51 0.2%
 
11.07 51 0.2%
 
Other values (1964) 16834 67.3%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 7497 30.0%
 
0.01 9 0.0%
 
0.02 4 0.0%
 
0.03 6 0.0%
 
0.04 6 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
93.23 3 0.0%
 
93.38 1 0.0%
 
95.24 1 0.0%
 
99.76 1 0.0%
 
100.0 75 0.3%
 

portfolio_energy_allocation
Numeric

Distinct count 1459
Unique (%) 5.8%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.8266
Minimum 0
Maximum 100
Zeros (%) 35.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 3.38
Q3 6.25
95-th percentile 14.05
Maximum 100
Range 100
Interquartile range 6.25

Descriptive statistics

Standard deviation 13.687
Coef of variation 2.3491
Kurtosis 33.696
Mean 5.8266
MAD 5.7585
Skewness 5.6142
Sum 145000
Variance 187.34
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 8909 35.6%
 
100.0 240 1.0%
 
5.42 89 0.4%
 
5.43 87 0.3%
 
5.4 85 0.3%
 
5.39 67 0.3%
 
5.88 66 0.3%
 
5.27 64 0.3%
 
5.99 61 0.2%
 
6.07 60 0.2%
 
Other values (1448) 15158 60.6%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 8909 35.6%
 
0.01 30 0.1%
 
0.02 11 0.0%
 
0.03 1 0.0%
 
0.04 6 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
99.13 1 0.0%
 
99.3 1 0.0%
 
99.9 3 0.0%
 
99.99 6 0.0%
 
100.0 240 1.0%
 

portfolio_financial_services
Numeric

Distinct count 2287
Unique (%) 9.1%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 11.838
Minimum 0
Maximum 100
Zeros (%) 32.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 13.12
Q3 17.91
95-th percentile 27.49
Maximum 100
Range 100
Interquartile range 17.91

Descriptive statistics

Standard deviation 12.286
Coef of variation 1.0379
Kurtosis 16.103
Mean 11.838
MAD 8.9789
Skewness 2.6968
Sum 294600
Variance 150.96
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 8039 32.2%
 
100.0 107 0.4%
 
13.38 71 0.3%
 
15.63 53 0.2%
 
17.4 50 0.2%
 
16.79 49 0.2%
 
15.39 45 0.2%
 
16.02 44 0.2%
 
17.92 44 0.2%
 
17.56 43 0.2%
 
Other values (2276) 16341 65.4%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 8039 32.2%
 
0.01 3 0.0%
 
0.02 9 0.0%
 
0.03 1 0.0%
 
0.04 3 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
99.48 1 0.0%
 
99.5 1 0.0%
 
99.92 5 0.0%
 
99.95 5 0.0%
 
100.0 107 0.4%
 

portfolio_healthcare_allocation
Numeric

Distinct count 1965
Unique (%) 7.9%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 8.5369
Minimum 0
Maximum 100
Zeros (%) 33.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 9.37
Q3 13.56
95-th percentile 19.8
Maximum 100
Range 100
Interquartile range 13.56

Descriptive statistics

Standard deviation 9.6185
Coef of variation 1.1267
Kurtosis 31.672
Mean 8.5369
MAD 6.8587
Skewness 3.8548
Sum 212450
Variance 92.515
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 8385 33.5%
 
12.12 63 0.3%
 
12.63 60 0.2%
 
14.5 56 0.2%
 
12.42 53 0.2%
 
14.49 53 0.2%
 
11.16 47 0.2%
 
11.17 47 0.2%
 
12.49 45 0.2%
 
16.13 45 0.2%
 
Other values (1954) 16032 64.1%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 8385 33.5%
 
0.01 24 0.1%
 
0.02 12 0.0%
 
0.03 5 0.0%
 
0.04 13 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
99.66 1 0.0%
 
99.71 2 0.0%
 
99.78 1 0.0%
 
99.92 4 0.0%
 
100.0 21 0.1%
 

portfolio_industrials_allocation
Numeric

Distinct count 2037
Unique (%) 8.1%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.056
Minimum 0
Maximum 100
Zeros (%) 29.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 9.57
Q3 12.75
95-th percentile 21.81
Maximum 100
Range 100
Interquartile range 12.75

Descriptive statistics

Standard deviation 10.17
Coef of variation 1.1231
Kurtosis 28.854
Mean 9.056
MAD 6.5424
Skewness 3.9358
Sum 225370
Variance 103.44
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 7368 29.5%
 
10.23 69 0.3%
 
100.0 67 0.3%
 
10.31 64 0.3%
 
11.02 61 0.2%
 
10.18 52 0.2%
 
10.97 50 0.2%
 
10.63 48 0.2%
 
11.09 45 0.2%
 
11.49 45 0.2%
 
Other values (2026) 17017 68.1%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 7368 29.5%
 
0.01 17 0.1%
 
0.02 15 0.1%
 
0.03 15 0.1%
 
0.04 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
98.74 1 0.0%
 
99.08 1 0.0%
 
99.42 1 0.0%
 
99.7 1 0.0%
 
100.0 67 0.3%
 

portfolio_materials_basic_allocation
Numeric

Distinct count 1302
Unique (%) 5.2%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.8983
Minimum 0
Maximum 100
Zeros (%) 35.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 2.79
Q3 5.06
95-th percentile 10.207
Maximum 100
Range 100
Interquartile range 5.06

Descriptive statistics

Standard deviation 8.1363
Coef of variation 2.0872
Kurtosis 89.961
Mean 3.8983
MAD 3.5053
Skewness 8.4998
Sum 97012
Variance 66.2
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 8753 35.0%
 
4.3 97 0.4%
 
4.46 81 0.3%
 
100.0 77 0.3%
 
4.5 74 0.3%
 
2.46 67 0.3%
 
4.63 64 0.3%
 
5.02 61 0.2%
 
4.47 60 0.2%
 
4.38 60 0.2%
 
Other values (1291) 15492 62.0%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 8753 35.0%
 
0.01 29 0.1%
 
0.02 12 0.0%
 
0.03 11 0.0%
 
0.04 16 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
99.19 5 0.0%
 
99.81 1 0.0%
 
99.92 1 0.0%
 
99.99 5 0.0%
 
100.0 77 0.3%
 

portfolio_property_allocation
Numeric

Distinct count 1403
Unique (%) 5.6%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.9264
Minimum 0
Maximum 100
Zeros (%) 40.8%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1.55
Q3 4.44
95-th percentile 13.58
Maximum 100
Range 100
Interquartile range 4.44

Descriptive statistics

Standard deviation 13.855
Coef of variation 2.8125
Kurtosis 33.725
Mean 4.9264
MAD 5.7902
Skewness 5.6824
Sum 122600
Variance 191.97
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 10196 40.8%
 
2.5 75 0.3%
 
3.73 66 0.3%
 
100.0 65 0.3%
 
2.51 62 0.2%
 
2.44 61 0.2%
 
3.92 61 0.2%
 
1.47 54 0.2%
 
1.79 53 0.2%
 
4.26 51 0.2%
 
Other values (1392) 14142 56.6%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 10196 40.8%
 
0.01 35 0.1%
 
0.02 33 0.1%
 
0.03 26 0.1%
 
0.04 17 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
99.88 1 0.0%
 
99.89 5 0.0%
 
99.91 2 0.0%
 
99.92 5 0.0%
 
100.0 65 0.3%
 

portfolio_tech_allocation
Numeric

Distinct count 2592
Unique (%) 10.4%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 12.78
Minimum 0
Maximum 100
Zeros (%) 31.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 12.885
Q3 19.65
95-th percentile 33.35
Maximum 100
Range 100
Interquartile range 19.65

Descriptive statistics

Standard deviation 12.558
Coef of variation 0.98264
Kurtosis 5.6537
Mean 12.78
MAD 9.8613
Skewness 1.51
Sum 318040
Variance 157.71
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 7901 31.6%
 
19.2 64 0.3%
 
17.67 58 0.2%
 
18.06 47 0.2%
 
19.13 46 0.2%
 
22.94 43 0.2%
 
18.02 42 0.2%
 
16.9 41 0.2%
 
21.92 41 0.2%
 
17.87 41 0.2%
 
Other values (2581) 16562 66.2%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 7901 31.6%
 
0.01 15 0.1%
 
0.02 2 0.0%
 
0.03 1 0.0%
 
0.06 6 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
97.64 4 0.0%
 
97.81 1 0.0%
 
98.54 2 0.0%
 
99.31 2 0.0%
 
100.0 19 0.1%
 

portfolio_utils_allocation
Numeric

Distinct count 1034
Unique (%) 4.1%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7618
Minimum 0
Maximum 100
Zeros (%) 47.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0.43
Q3 3.34
95-th percentile 8.1
Maximum 100
Range 100
Interquartile range 3.34

Descriptive statistics

Standard deviation 7.5944
Coef of variation 2.7498
Kurtosis 87.457
Mean 2.7618
MAD 3.0692
Skewness 8.413
Sum 68730
Variance 57.675
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 11754 47.0%
 
3.34 113 0.5%
 
3.33 91 0.4%
 
3.2 79 0.3%
 
3.21 77 0.3%
 
3.35 75 0.3%
 
2.86 67 0.3%
 
3.15 65 0.3%
 
0.01 62 0.2%
 
4.18 62 0.2%
 
Other values (1023) 12441 49.8%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 11754 47.0%
 
0.01 62 0.2%
 
0.02 40 0.2%
 
0.03 38 0.2%
 
0.04 34 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
97.53 4 0.0%
 
98.59 3 0.0%
 
98.8 2 0.0%
 
99.02 1 0.0%
 
100.0 27 0.1%
 

Correlations

Sample

portfolio_communication_allocation portfolio_financial_services portfolio_industrials_allocation portfolio_tech_allocation portfolio_materials_basic_allocation portfolio_energy_allocation portfolio_consumer_defence_allocation portfolio_healthcare_allocation portfolio_property_allocation id portfolio_utils_allocation portfolio_cyclical_consumer_allocation
0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 67922 0.00 0.00
1 0.78 9.77 9.97 35.51 2.86 0.38 5.88 14.41 2.67 134783 0.39 17.38
2 4.70 16.40 11.45 25.09 8.36 0.00 9.42 16.47 1.03 61271 0.00 7.09
3 6.53 13.80 10.91 0.16 2.22 6.79 25.73 9.00 0.00 64412 19.42 5.43
4 3.49 13.95 10.51 19.26 3.75 5.11 7.29 12.22 10.41 184058 3.07 10.95
In [9]:
#fund_config comprises of 4 columns which comprise the metadata of the mutual funds
fund_config = pd.read_csv('Hackathon_Files/external/fund_config.csv')
#pandas_profiling.ProfileReport(fund_config)
print(fund_config.info())
fund_config.describe().transpose()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 4 columns):
category          25000 non-null object
parent_company    25000 non-null object
fund_id           25000 non-null object
fund_name         25000 non-null object
dtypes: object(4)
memory usage: 781.3+ KB
None
Out[9]:
count unique top freq
category 25000 111 Large Growth 1335
parent_company 25000 761 Fidelity Investments 966
fund_id 25000 25000 e7dff334-3313-4348-917a-64c631da08f1 1
fund_name 25000 24958 Calamos Investment Trust - Calamos Focus Growt... 4
In [10]:
#fund_ratios consists of 8 columns which provides information on various fundamental ratios that describe the mutual funds
fund_ratios = pd.read_csv('Hackathon_Files/external/fund_ratios.csv')
pandas_profiling.ProfileReport(fund_ratios)
Out[10]:

Overview

Dataset info

Number of variables 8
Number of observations 25000
Total Missing (%) 0.3%
Total size in memory 1.5 MiB
Average record size in memory 64.0 B

Variables types

Numeric 3
Categorical 4
Boolean 0
Date 0
Text (Unique) 1
Rejected 0
Unsupported 0

Warnings

  • mmc has a high cardinality: 5689 distinct values Warning
  • pb_ratio has 6059 / 24.2% zeros Zeros
  • pb_ratio is highly skewed (γ1 = 30.129) Skewed
  • pc_ratio has a high cardinality: 1584 distinct values Warning
  • pe_ratio has a high cardinality: 1782 distinct values Warning
  • ps_ratio has a high cardinality: 556 distinct values Warning

Variables

fund_id
Categorical, Unique

First 3 values
e7dff334-3313-4348-917a-64c631da08f1
abf7f06e-6d96-4016-a9c8-2c7975ecf778
0edb76db-aca6-4b0f-8e4e-772674e188fa
Last 3 values
5c653690-cbea-4370-908e-582b0c74cc2d
c97e052e-0f2d-42bb-bacd-f58e116d4c85
819f40d9-f07d-480d-9be8-045999bbb7f5

First 10 values

Value Count Frequency (%)  
0002e898-709a-4b80-8f5c-ec846feff26c 1 0.0%
 
00070160-01a2-4ad3-9290-958a110c8e9f 1 0.0%
 
0009d9da-6735-46c1-81cd-dbc62c53c2e2 1 0.0%
 
000ad9cc-3f7e-48f3-a1f1-4f5c03d3eb6d 1 0.0%
 
000b6091-3c16-41a1-9df4-fce73767dd21 1 0.0%
 

Last 10 values

Value Count Frequency (%)  
fff6de73-cbbd-4814-a59a-f0210d669eae 1 0.0%
 
fff75f2a-1419-4d65-a68f-89d601d47350 1 0.0%
 
fff79179-2ca5-4f26-a023-929c255aeda4 1 0.0%
 
fffb0e0f-2dc9-4e86-b534-476f9669720b 1 0.0%
 
fffe9b65-2288-4d99-844e-89e7747aa323 1 0.0%
 

fund_ratio_net_annual_expense
Numeric

Distinct count 420
Unique (%) 1.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.1217
Minimum 0
Maximum 15.17
Zeros (%) 0.4%

Quantile statistics

Minimum 0
5-th percentile 0.35
Q1 0.72
Median 1.01
Q3 1.44
95-th percentile 2.15
Maximum 15.17
Range 15.17
Interquartile range 0.72

Descriptive statistics

Standard deviation 0.60922
Coef of variation 0.54313
Kurtosis 21.129
Mean 1.1217
MAD 0.45287
Skewness 2.0915
Sum 28042
Variance 0.37114
Memory size 195.4 KiB
Value Count Frequency (%)  
1.0 339 1.4%
 
0.95 334 1.3%
 
0.75 322 1.3%
 
0.9 311 1.2%
 
0.65 301 1.2%
 
0.8 295 1.2%
 
0.85 276 1.1%
 
1.15 265 1.1%
 
0.99 263 1.1%
 
1.25 256 1.0%
 
Other values (410) 22038 88.2%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 89 0.4%
 
0.01 35 0.1%
 
0.02 12 0.0%
 
0.03 34 0.1%
 
0.04 26 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
8.26 1 0.0%
 
8.95 1 0.0%
 
10.39 1 0.0%
 
10.64 1 0.0%
 
15.17 1 0.0%
 

mmc
Categorical

Distinct count 5689
Unique (%) 22.8%
Missing (%) 0.5%
Missing (n) 114
0
6008
828.01
 
75
2,193.13
 
41
Other values (5685)
18762
(Missing)
 
114
Value Count Frequency (%)  
0 6008 24.0%
 
828.01 75 0.3%
 
2,193.13 41 0.2%
 
9,234.14 34 0.1%
 
88,146.69 17 0.1%
 
95,232.43 17 0.1%
 
1,063.09 17 0.1%
 
43,954.74 17 0.1%
 
39,247.34 17 0.1%
 
23,042.48 17 0.1%
 
Other values (5678) 18626 74.5%
 
(Missing) 114 0.5%
 

pb_ratio
Numeric

Distinct count 604
Unique (%) 2.4%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.8543
Minimum 0
Maximum 123.3
Zeros (%) 24.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0.56
Median 1.85
Q3 2.38
95-th percentile 4.5
Maximum 123.3
Range 123.3
Interquartile range 1.82

Descriptive statistics

Standard deviation 2.9842
Coef of variation 1.6094
Kurtosis 1211.6
Mean 1.8543
MAD 1.1158
Skewness 30.129
Sum 46145
Variance 8.9057
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 6059 24.2%
 
2.0 235 0.9%
 
2.01 218 0.9%
 
1.94 181 0.7%
 
2.13 180 0.7%
 
1.96 173 0.7%
 
2.03 172 0.7%
 
1.92 170 0.7%
 
2.02 167 0.7%
 
1.99 158 0.6%
 
Other values (593) 17173 68.7%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 6059 24.2%
 
0.12 2 0.0%
 
0.26 7 0.0%
 
0.27 6 0.0%
 
0.29 5 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
10.77 1 0.0%
 
11.17 2 0.0%
 
14.07 4 0.0%
 
22.47 17 0.1%
 
123.3 11 0.0%
 

pc_ratio
Categorical

Distinct count 1584
Unique (%) 6.3%
Missing (%) 0.5%
Missing (n) 114
0
4144
0.0
 
1900
6.99
 
98
Other values (1580)
18744
(Missing)
 
114
Value Count Frequency (%)  
0 4144 16.6%
 
0.0 1900 7.6%
 
6.99 98 0.4%
 
7.18 98 0.4%
 
0.46 92 0.4%
 
7.21 81 0.3%
 
7.63 78 0.3%
 
7.54 77 0.3%
 
7.23 76 0.3%
 
7.96 76 0.3%
 
Other values (1573) 18166 72.7%
 
(Missing) 114 0.5%
 

pe_ratio
Categorical

Distinct count 1782
Unique (%) 7.1%
Missing (%) 0.5%
Missing (n) 114
0
4128
0.0
 
1910
3.65
 
92
Other values (1778)
18756
(Missing)
 
114
Value Count Frequency (%)  
0 4128 16.5%
 
0.0 1910 7.6%
 
3.65 92 0.4%
 
15.14 89 0.4%
 
15.37 87 0.3%
 
15.87 86 0.3%
 
17.05 69 0.3%
 
16.15 67 0.3%
 
16.57 66 0.3%
 
15.09 66 0.3%
 
Other values (1771) 18226 72.9%
 
(Missing) 114 0.5%
 

ps_ratio
Categorical

Distinct count 556
Unique (%) 2.2%
Missing (%) 0.5%
Missing (n) 114
0.0
3959
0
 
2026
1.49
 
273
Other values (552)
18628
Value Count Frequency (%)  
0.0 3959 15.8%
 
0 2026 8.1%
 
1.49 273 1.1%
 
1.47 252 1.0%
 
1.51 249 1.0%
 
1.45 238 1.0%
 
1.5 221 0.9%
 
0.99 193 0.8%
 
1.31 183 0.7%
 
1.54 180 0.7%
 
Other values (545) 17112 68.4%
 

tag
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

Correlations

Sample

fund_id tag fund_ratio_net_annual_expense pb_ratio ps_ratio mmc pc_ratio pe_ratio
0 264614c6-5ac3-4146-ba26-1674b136cb40 67922 1.44 1.71 1.31 19,857.41 5.91 14.51
1 f5ad58c2-fdea-4087-8678-e04744f89f90 134783 0.58 5.30 3.38 72,347.03 15.95 18.88
2 3c13f4ab-02c4-4ca7-a133-7e996ec5d0c4 61271 0.99 5.40 3.67 68,857.43 15.97 23.27
3 ff78bdd8-59eb-4cef-9f3c-b1baacce9554 64412 0.52 2.23 1.63 43,266.62 8.93 12.7
4 63d8406d-c525-494a-8e03-d4fc4cfcb571 184058 0.75 2.02 1.4 43,747.9 7.59 14.74
In [11]:
#fund_specs contains 9 columns which give information about the specifications of the mutual funds
fund_specs = pd.read_csv('Hackathon_Files/external/fund_specs.csv')
pandas_profiling.ProfileReport(fund_specs)
Out[11]:

Overview

Dataset info

Number of variables 9
Number of observations 25000
Total Missing (%) 3.7%
Total size in memory 1.7 MiB
Average record size in memory 72.0 B

Variables types

Numeric 5
Categorical 3
Boolean 0
Date 0
Text (Unique) 0
Rejected 1
Unsupported 0

Warnings

Variables

currency
Constant

This variable is constant and should be ignored for analysis

Constant value USD

fund_size
Categorical

Distinct count 4
Unique (%) 0.0%
Missing (%) 5.9%
Missing (n) 1480
Large
14173
Medium
6009
Small
3338
(Missing)
 
1480
Value Count Frequency (%)  
Large 14173 56.7%
 
Medium 6009 24.0%
 
Small 3338 13.4%
 
(Missing) 1480 5.9%
 

greatstone_rating
Numeric

Distinct count 7
Unique (%) 0.0%
Missing (%) 20.0%
Missing (n) 5000
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.8397
Minimum 0
Maximum 5
Zeros (%) 5.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 2
Median 3
Q3 4
95-th percentile 5
Maximum 5
Range 5
Interquartile range 2

Descriptive statistics

Standard deviation 1.2774
Coef of variation 0.44984
Kurtosis -0.17408
Mean 2.8397
MAD 0.99599
Skewness -0.448
Sum 56795
Variance 1.6319
Memory size 195.4 KiB
Value Count Frequency (%)  
3.0 6786 27.1%
 
4.0 4614 18.5%
 
2.0 4230 16.9%
 
5.0 1629 6.5%
 
1.0 1376 5.5%
 
0.0 1365 5.5%
 
(Missing) 5000 20.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 1365 5.5%
 
1.0 1376 5.5%
 
2.0 4230 16.9%
 
3.0 6786 27.1%
 
4.0 4614 18.5%
 

Maximum 5 values

Value Count Frequency (%)  
1.0 1376 5.5%
 
2.0 4230 16.9%
 
3.0 6786 27.1%
 
4.0 4614 18.5%
 
5.0 1629 6.5%
 

inception_date
Categorical

Distinct count 4383
Unique (%) 17.5%
Missing (%) 0.0%
Missing (n) 0
2015-06-29
 
118
2017-12-28
 
115
2014-03-31
 
110
Other values (4380)
24657
Value Count Frequency (%)  
2015-06-29 118 0.5%
 
2017-12-28 115 0.5%
 
2014-03-31 110 0.4%
 
2007-09-27 104 0.4%
 
2001-02-28 102 0.4%
 
2012-11-07 102 0.4%
 
2014-12-30 97 0.4%
 
2015-11-29 95 0.4%
 
2009-07-05 95 0.4%
 
2005-03-31 92 0.4%
 
Other values (4373) 23970 95.9%
 

investment_class
Categorical

Distinct count 4
Unique (%) 0.0%
Missing (%) 5.9%
Missing (n) 1480
Blend
10298
Growth
6671
Value
6551
(Missing)
 
1480
Value Count Frequency (%)  
Blend 10298 41.2%
 
Growth 6671 26.7%
 
Value 6551 26.2%
 
(Missing) 1480 5.9%
 

return_ytd
Numeric

Distinct count 2752
Unique (%) 11.0%
Missing (%) 0.4%
Missing (n) 108
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.2889
Minimum -36.3
Maximum 46.29
Zeros (%) 0.1%

Quantile statistics

Minimum -36.3
5-th percentile 1.37
Q1 4.43
Median 9.82
Q3 13.08
95-th percentile 18.32
Maximum 46.29
Range 82.59
Interquartile range 8.65

Descriptive statistics

Standard deviation 5.801
Coef of variation 0.62451
Kurtosis 2.2438
Mean 9.2889
MAD 4.6308
Skewness -0.11919
Sum 231220
Variance 33.652
Memory size 195.4 KiB
Value Count Frequency (%)  
2.45 36 0.1%
 
11.88 36 0.1%
 
11.33 34 0.1%
 
10.94 34 0.1%
 
11.76 33 0.1%
 
2.76 32 0.1%
 
2.62 32 0.1%
 
10.27 31 0.1%
 
3.4 31 0.1%
 
11.21 31 0.1%
 
Other values (2741) 24562 98.2%
 
(Missing) 108 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-36.3 1 0.0%
 
-36.14 1 0.0%
 
-27.8 1 0.0%
 
-27.79 1 0.0%
 
-27.7 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
38.96 1 0.0%
 
41.33 1 0.0%
 
45.78 1 0.0%
 
45.88 1 0.0%
 
46.29 1 0.0%
 

tag
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

total_assets
Numeric

Distinct count 6014
Unique (%) 24.1%
Missing (%) 0.5%
Missing (n) 119
Infinite (%) 0.0%
Infinite (n) 0
Mean 3476500000
Minimum 19160
Maximum 772720000000
Zeros (%) 0.0%

Quantile statistics

Minimum 19160
5-th percentile 10240000
Q1 93030000
Median 441790000
Q3 1620000000
95-th percentile 11840000000
Maximum 772720000000
Range 772720000000
Interquartile range 1527000000

Descriptive statistics

Standard deviation 18275000000
Coef of variation 5.2568
Kurtosis 732.19
Mean 3476500000
MAD 4840100000
Skewness 21.584
Sum 86498000000000
Variance 3.3398e+2
Memory size 195.4 KiB
Value Count Frequency (%)  
1480000000.0 65 0.3%
 
1030000000.0 63 0.3%
 
1100000000.0 62 0.2%
 
1230000000.0 58 0.2%
 
1370000000.0 58 0.2%
 
1160000000.0 56 0.2%
 
1620000000.0 56 0.2%
 
1260000000.0 55 0.2%
 
1020000000.0 52 0.2%
 
1080000000.0 52 0.2%
 
Other values (6003) 24304 97.2%
 
(Missing) 119 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
19160.0 1 0.0%
 
24840.0 1 0.0%
 
51310.0 1 0.0%
 
73820.0 1 0.0%
 
77330.0 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
215930000000.0 5 0.0%
 
224720000000.0 2 0.0%
 
369860000000.0 5 0.0%
 
459650000000.0 3 0.0%
 
772720000000.0 5 0.0%
 

yield
Numeric

Distinct count 891
Unique (%) 3.6%
Missing (%) 0.5%
Missing (n) 127
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.8504
Minimum 0
Maximum 45.36
Zeros (%) 16.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0.48
Median 1.65
Q3 2.64
95-th percentile 4.98
Maximum 45.36
Range 45.36
Interquartile range 2.16

Descriptive statistics

Standard deviation 1.8043
Coef of variation 0.97507
Kurtosis 45.583
Mean 1.8504
MAD 1.271
Skewness 3.6622
Sum 46026
Variance 3.2555
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 4134 16.5%
 
2.05 88 0.4%
 
1.96 83 0.3%
 
1.95 77 0.3%
 
2.15 76 0.3%
 
2.35 76 0.3%
 
1.87 76 0.3%
 
2.06 76 0.3%
 
1.3 75 0.3%
 
2.03 74 0.3%
 
Other values (880) 20038 80.2%
 
(Missing) 127 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 4134 16.5%
 
0.0068 1 0.0%
 
0.01 44 0.2%
 
0.02 61 0.2%
 
0.03 44 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
29.94 1 0.0%
 
30.16 1 0.0%
 
38.75 1 0.0%
 
38.77 1 0.0%
 
45.36 1 0.0%
 

Correlations

Sample

investment_class currency total_assets yield greatstone_rating inception_date tag fund_size return_ytd
0 Value USD 1.185000e+07 5.57 NaN 2015-02-02 67922 Large 20.19
1 Growth USD 1.397000e+10 0.42 3.0 2012-05-30 134783 Large 16.79
2 Growth USD 2.660000e+09 0.02 4.0 1987-08-23 61271 Large 17.13
3 Value USD 1.957000e+10 2.71 3.0 2005-10-24 64412 Large 11.63
4 Blend USD 2.847000e+07 2.44 0.0 2016-12-12 184058 Large 10.25
In [12]:
#other_specs contains 43 columns which give information of the other aspects of the mutual funds
other_specs = pd.read_csv('Hackathon_Files/external/other_specs.csv')
pandas_profiling.ProfileReport(other_specs)
Out[12]:

Overview

Dataset info

Number of variables 43
Number of observations 25000
Total Missing (%) 10.9%
Total size in memory 8.2 MiB
Average record size in memory 344.0 B

Variables types

Numeric 35
Categorical 4
Boolean 0
Date 0
Text (Unique) 0
Rejected 4
Unsupported 0

Warnings

Variables

1_month_fund_return
Numeric

Distinct count 1266
Unique (%) 5.1%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.96168
Minimum -13.63
Maximum 15.29
Zeros (%) 0.5%

Quantile statistics

Minimum -13.63
5-th percentile -2.03
Q1 0.35
Median 1.1
Q3 1.72
95-th percentile 3.35
Maximum 15.29
Range 28.92
Interquartile range 1.37

Descriptive statistics

Standard deviation 1.6943
Coef of variation 1.7619
Kurtosis 7.2433
Mean 0.96168
MAD 1.1087
Skewness 0.0010391
Sum 23932
Variance 2.8708
Memory size 195.4 KiB
Value Count Frequency (%)  
1.17 156 0.6%
 
1.18 154 0.6%
 
1.15 142 0.6%
 
1.28 138 0.6%
 
1.23 133 0.5%
 
0.0 131 0.5%
 
1.12 130 0.5%
 
1.39 129 0.5%
 
1.32 128 0.5%
 
1.27 128 0.5%
 
Other values (1255) 23516 94.1%
 

Minimum 5 values

Value Count Frequency (%)  
-13.63 1 0.0%
 
-10.34 1 0.0%
 
-10.22 1 0.0%
 
-9.53 1 0.0%
 
-9.48 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
14.26 1 0.0%
 
15.11 1 0.0%
 
15.17 1 0.0%
 
15.27 1 0.0%
 
15.29 1 0.0%
 

1_year_return_fund
Numeric

Distinct count 3653
Unique (%) 14.6%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.6062
Minimum -37.09
Maximum 59.19
Zeros (%) 0.1%

Quantile statistics

Minimum -37.09
5-th percentile -9.416
Q1 -0.06
Median 3.1
Q3 5.12
95-th percentile 13.43
Maximum 59.19
Range 96.28
Interquartile range 5.18

Descriptive statistics

Standard deviation 6.6941
Coef of variation 2.5685
Kurtosis 2.9963
Mean 2.6062
MAD 4.5776
Skewness -0.0084128
Sum 64855
Variance 44.81
Memory size 195.4 KiB
Value Count Frequency (%)  
3.45 49 0.2%
 
3.34 49 0.2%
 
2.32 47 0.2%
 
3.46 47 0.2%
 
2.92 46 0.2%
 
4.35 46 0.2%
 
2.75 46 0.2%
 
3.39 45 0.2%
 
2.83 45 0.2%
 
4.27 45 0.2%
 
Other values (3642) 24420 97.7%
 
(Missing) 115 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-37.09 1 0.0%
 
-36.54 1 0.0%
 
-35.73 1 0.0%
 
-35.47 1 0.0%
 
-35.2 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
47.08 1 0.0%
 
47.11 1 0.0%
 
50.43 1 0.0%
 
52.43 1 0.0%
 
59.19 1 0.0%
 

2010_return_category
Numeric

Distinct count 103
Unique (%) 0.4%
Missing (%) 46.2%
Missing (n) 11538
Infinite (%) 0.0%
Infinite (n) 0
Mean 13.155
Minimum -28.95
Maximum 41.56
Zeros (%) 0.0%

Quantile statistics

Minimum -28.95
5-th percentile 1.65
Q1 8.6
Median 13.66
Q3 15.53
95-th percentile 26.17
Maximum 41.56
Range 70.51
Interquartile range 6.93

Descriptive statistics

Standard deviation 7.6595
Coef of variation 0.58224
Kurtosis 2.5943
Mean 13.155
MAD 5.5914
Skewness -0.21867
Sum 177090
Variance 58.667
Memory size 195.4 KiB
Value Count Frequency (%)  
15.53 850 3.4%
 
14.01 844 3.4%
 
13.66 701 2.8%
 
7.72 599 2.4%
 
11.83 457 1.8%
 
13.74 412 1.6%
 
26.98 409 1.6%
 
10.24 394 1.6%
 
25.61 387 1.5%
 
24.61 371 1.5%
 
Other values (92) 8038 32.2%
 
(Missing) 11538 46.2%
 

Minimum 5 values

Value Count Frequency (%)  
-28.95 8 0.0%
 
-28.7 4 0.0%
 
-24.28 50 0.2%
 
-15.61 13 0.1%
 
-2.0 45 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
27.08 141 0.6%
 
27.35 27 0.1%
 
29.99 22 0.1%
 
30.88 9 0.0%
 
41.56 50 0.2%
 

2010_return_fund
Numeric

Distinct count 3359
Unique (%) 13.4%
Missing (%) 49.0%
Missing (n) 12262
Infinite (%) 0.0%
Infinite (n) 0
Mean 13.603
Minimum -51.55
Maximum 54.5
Zeros (%) 0.0%

Quantile statistics

Minimum -51.55
5-th percentile 1.46
Q1 8.21
Median 13.07
Q3 18.15
95-th percentile 28.501
Maximum 54.5
Range 106.05
Interquartile range 9.94

Descriptive statistics

Standard deviation 8.9666
Coef of variation 0.65915
Kurtosis 4.028
Mean 13.603
MAD 6.5522
Skewness -0.16529
Sum 173280
Variance 80.4
Memory size 195.4 KiB
Value Count Frequency (%)  
10.62 18 0.1%
 
14.56 17 0.1%
 
14.02 17 0.1%
 
12.46 16 0.1%
 
13.34 15 0.1%
 
11.22 15 0.1%
 
13.51 15 0.1%
 
11.41 15 0.1%
 
15.86 15 0.1%
 
14.36 15 0.1%
 
Other values (3348) 12580 50.3%
 
(Missing) 12262 49.0%
 

Minimum 5 values

Value Count Frequency (%)  
-51.55 1 0.0%
 
-51.19 1 0.0%
 
-50.13 1 0.0%
 
-50.11 1 0.0%
 
-49.86 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
53.32 1 0.0%
 
53.33 1 0.0%
 
53.9 1 0.0%
 
53.96 1 0.0%
 
54.5 1 0.0%
 

2011_return_category
Numeric

Distinct count 102
Unique (%) 0.4%
Missing (%) 42.1%
Missing (n) 10533
Infinite (%) 0.0%
Infinite (n) 0
Mean -1.8647
Minimum -35.5
Maximum 32.9
Zeros (%) 0.0%

Quantile statistics

Minimum -35.5
5-th percentile -14.72
Q1 -4.07
Median -2.06
Q3 2.01
95-th percentile 10.18
Maximum 32.9
Range 68.4
Interquartile range 6.08

Descriptive statistics

Standard deviation 7.192
Coef of variation -3.8568
Kurtosis 1.2128
Mean -1.8647
MAD 5.2441
Skewness -0.44908
Sum -26977
Variance 51.724
Memory size 195.4 KiB
Value Count Frequency (%)  
-2.46 892 3.6%
 
-1.27 866 3.5%
 
-0.75 736 2.9%
 
-3.96 635 2.5%
 
5.86 630 2.5%
 
-13.97 499 2.0%
 
-0.11 477 1.9%
 
-7.93 438 1.8%
 
-3.55 432 1.7%
 
-4.07 403 1.6%
 
Other values (91) 8459 33.8%
 
(Missing) 10533 42.1%
 

Minimum 5 values

Value Count Frequency (%)  
-35.5 13 0.1%
 
-24.95 41 0.2%
 
-22.64 11 0.0%
 
-21.45 13 0.1%
 
-20.95 41 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
10.64 119 0.5%
 
10.93 132 0.5%
 
11.47 12 0.0%
 
11.74 70 0.3%
 
32.9 16 0.1%
 

2011_return_fund
Numeric

Distinct count 3475
Unique (%) 13.9%
Missing (%) 44.7%
Missing (n) 11163
Infinite (%) 0.0%
Infinite (n) 0
Mean -1.3365
Minimum -43.78
Maximum 55.81
Zeros (%) 0.1%

Quantile statistics

Minimum -43.78
5-th percentile -16.78
Q1 -5.33
Median -0.52
Q3 4.04
95-th percentile 10.572
Maximum 55.81
Range 99.59
Interquartile range 9.37

Descriptive statistics

Standard deviation 8.4071
Coef of variation -6.2903
Kurtosis 1.8999
Mean -1.3365
MAD 6.3388
Skewness -0.55184
Sum -18494
Variance 70.68
Memory size 195.4 KiB
Value Count Frequency (%)  
-2.65 16 0.1%
 
-2.28 16 0.1%
 
1.13 16 0.1%
 
-2.59 16 0.1%
 
2.03 15 0.1%
 
1.16 15 0.1%
 
1.73 15 0.1%
 
1.32 15 0.1%
 
-0.79 15 0.1%
 
1.12 15 0.1%
 
Other values (3464) 13683 54.7%
 
(Missing) 11163 44.7%
 

Minimum 5 values

Value Count Frequency (%)  
-43.78 1 0.0%
 
-43.25 1 0.0%
 
-42.71 1 0.0%
 
-42.53 1 0.0%
 
-41.56 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
43.21 1 0.0%
 
44.49 1 0.0%
 
55.65 1 0.0%
 
55.79 1 0.0%
 
55.81 1 0.0%
 

2012_fund_return
Numeric

Distinct count 3026
Unique (%) 12.1%
Missing (%) 39.9%
Missing (n) 9985
Infinite (%) 0.0%
Infinite (n) 0
Mean 12.898
Minimum -43.9
Maximum 81.66
Zeros (%) 0.0%

Quantile statistics

Minimum -43.9
5-th percentile 2.347
Q1 8.89
Median 13.49
Q3 16.84
95-th percentile 22.839
Maximum 81.66
Range 125.56
Interquartile range 7.95

Descriptive statistics

Standard deviation 7.1249
Coef of variation 0.5524
Kurtosis 7.1155
Mean 12.898
MAD 5.1394
Skewness -0.49862
Sum 193670
Variance 50.765
Memory size 195.4 KiB
Value Count Frequency (%)  
15.63 24 0.1%
 
13.33 22 0.1%
 
15.71 21 0.1%
 
16.35 21 0.1%
 
14.71 20 0.1%
 
13.98 20 0.1%
 
15.25 19 0.1%
 
12.38 19 0.1%
 
16.54 18 0.1%
 
16.05 18 0.1%
 
Other values (3015) 14813 59.3%
 
(Missing) 9985 39.9%
 

Minimum 5 values

Value Count Frequency (%)  
-43.9 1 0.0%
 
-43.36 1 0.0%
 
-37.78 1 0.0%
 
-37.19 1 0.0%
 
-35.64 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
54.75 1 0.0%
 
64.18 1 0.0%
 
65.82 1 0.0%
 
79.82 1 0.0%
 
81.66 1 0.0%
 

2012_return_category
Numeric

Distinct count 103
Unique (%) 0.4%
Missing (%) 36.5%
Missing (n) 9124
Infinite (%) 0.0%
Infinite (n) 0
Mean 12.411
Minimum -23.7
Maximum 31.78
Zeros (%) 0.0%

Quantile statistics

Minimum -23.7
5-th percentile 2.8
Q1 9.01
Median 14.57
Q3 15.46
95-th percentile 18.29
Maximum 31.78
Range 55.48
Interquartile range 6.45

Descriptive statistics

Standard deviation 5.8154
Coef of variation 0.46857
Kurtosis 5.7132
Mean 12.411
MAD 4.3754
Skewness -1.1426
Sum 197040
Variance 33.819
Memory size 195.4 KiB
Value Count Frequency (%)  
15.34 957 3.8%
 
14.96 922 3.7%
 
14.57 793 3.2%
 
7.01 672 2.7%
 
11.72 535 2.1%
 
15.84 520 2.1%
 
18.29 459 1.8%
 
13.15 459 1.8%
 
14.67 431 1.7%
 
15.46 428 1.7%
 
Other values (92) 9700 38.8%
 
(Missing) 9124 36.5%
 

Minimum 5 values

Value Count Frequency (%)  
-23.7 50 0.2%
 
-19.55 8 0.0%
 
-10.52 13 0.1%
 
-9.2 50 0.2%
 
-7.39 51 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
22.64 27 0.1%
 
23.62 44 0.2%
 
24.77 62 0.2%
 
29.69 13 0.1%
 
31.78 129 0.5%
 

2013_category_return
Highly correlated

This variable is highly correlated with 2013_return_fund and should be ignored for analysis

Correlation 0.9414

2013_return_fund
Numeric

Distinct count 5513
Unique (%) 22.1%
Missing (%) 34.2%
Missing (n) 8538
Infinite (%) 0.0%
Infinite (n) 0
Mean 17.149
Minimum -67.62
Maximum 116.38
Zeros (%) 0.0%

Quantile statistics

Minimum -67.62
5-th percentile -5.7395
Q1 0.84
Median 18.59
Q3 31.69
95-th percentile 41.399
Maximum 116.38
Range 184
Interquartile range 30.85

Descriptive statistics

Standard deviation 17.117
Coef of variation 0.99814
Kurtosis 0.23209
Mean 17.149
MAD 14.6
Skewness -0.12006
Sum 282300
Variance 292.99
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 14 0.1%
 
0.46 14 0.1%
 
-1.78 14 0.1%
 
-2.5 13 0.1%
 
-2.2 13 0.1%
 
-0.17 12 0.0%
 
-2.26 12 0.0%
 
-1.94 12 0.0%
 
-2.03 12 0.0%
 
19.44 12 0.0%
 
Other values (5502) 16334 65.3%
 
(Missing) 8538 34.2%
 

Minimum 5 values

Value Count Frequency (%)  
-67.62 1 0.0%
 
-67.28 1 0.0%
 
-54.0 1 0.0%
 
-53.55 1 0.0%
 
-53.39 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
93.71 1 0.0%
 
108.7 1 0.0%
 
110.85 1 0.0%
 
114.22 1 0.0%
 
116.38 1 0.0%
 

2014_category_return
Numeric

Distinct count 109
Unique (%) 0.4%
Missing (%) 24.7%
Missing (n) 6183
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.677
Minimum -17.98
Maximum 44.59
Zeros (%) 0.0%

Quantile statistics

Minimum -17.98
5-th percentile -4.98
Q1 1.54
Median 5.04
Q3 9.31
95-th percentile 10.96
Maximum 44.59
Range 62.57
Interquartile range 7.77

Descriptive statistics

Standard deviation 6.2251
Coef of variation 1.331
Kurtosis 4.3396
Mean 4.677
MAD 4.3946
Skewness 0.18587
Sum 88008
Variance 38.752
Memory size 195.4 KiB
Value Count Frequency (%)  
10.0 1078 4.3%
 
10.96 1024 4.1%
 
10.21 891 3.6%
 
5.18 736 2.9%
 
2.79 622 2.5%
 
6.21 577 2.3%
 
-3.01 547 2.2%
 
1.11 516 2.1%
 
2.44 514 2.1%
 
-4.98 504 2.0%
 
Other values (98) 11808 47.2%
 
(Missing) 6183 24.7%
 

Minimum 5 values

Value Count Frequency (%)  
-17.98 82 0.3%
 
-17.48 52 0.2%
 
-17.23 14 0.1%
 
-16.65 58 0.2%
 
-14.46 15 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
21.7 17 0.1%
 
27.25 93 0.4%
 
28.03 184 0.7%
 
33.36 7 0.0%
 
44.59 14 0.1%
 

2014_return_fund
Numeric

Distinct count 3316
Unique (%) 13.3%
Missing (%) 28.8%
Missing (n) 7206
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.0969
Minimum -42.4
Maximum 63.8
Zeros (%) 0.0%

Quantile statistics

Minimum -42.4
5-th percentile -6.6835
Q1 1.42
Median 5.2
Q3 9.2
95-th percentile 14.693
Maximum 63.8
Range 106.2
Interquartile range 7.78

Descriptive statistics

Standard deviation 7.4266
Coef of variation 1.4571
Kurtosis 4.8801
Mean 5.0969
MAD 5.1713
Skewness 0.019534
Sum 90694
Variance 55.154
Memory size 195.4 KiB
Value Count Frequency (%)  
5.58 29 0.1%
 
5.88 26 0.1%
 
5.54 26 0.1%
 
4.12 26 0.1%
 
5.99 25 0.1%
 
5.66 24 0.1%
 
4.86 24 0.1%
 
6.01 23 0.1%
 
5.81 23 0.1%
 
5.53 23 0.1%
 
Other values (3305) 17545 70.2%
 
(Missing) 7206 28.8%
 

Minimum 5 values

Value Count Frequency (%)  
-42.4 1 0.0%
 
-40.84 1 0.0%
 
-40.55 1 0.0%
 
-40.43 1 0.0%
 
-40.42 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
48.22 1 0.0%
 
52.86 1 0.0%
 
54.48 1 0.0%
 
63.71 1 0.0%
 
63.8 1 0.0%
 

2015_return_fund
Numeric

Distinct count 3070
Unique (%) 12.3%
Missing (%) 22.8%
Missing (n) 5688
Infinite (%) 0.0%
Infinite (n) 0
Mean -1.9572
Minimum -62.11
Maximum 86.62
Zeros (%) 0.2%

Quantile statistics

Minimum -62.11
5-th percentile -13.24
Q1 -3.81
Median -1.16
Q3 1.04
95-th percentile 5.99
Maximum 86.62
Range 148.73
Interquartile range 4.85

Descriptive statistics

Standard deviation 6.3592
Coef of variation -3.249
Kurtosis 12.834
Mean -1.9572
MAD 4.0475
Skewness -1.6307
Sum -37798
Variance 40.439
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 41 0.2%
 
-1.5 35 0.1%
 
-0.87 35 0.1%
 
-1.12 35 0.1%
 
-0.04 34 0.1%
 
-1.66 34 0.1%
 
-1.17 32 0.1%
 
-1.42 32 0.1%
 
0.32 31 0.1%
 
0.3 31 0.1%
 
Other values (3059) 18972 75.9%
 
(Missing) 5688 22.8%
 

Minimum 5 values

Value Count Frequency (%)  
-62.11 1 0.0%
 
-61.76 1 0.0%
 
-56.95 1 0.0%
 
-56.49 2 0.0%
 
-56.43 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
22.61 1 0.0%
 
29.51 1 0.0%
 
30.81 1 0.0%
 
85.18 1 0.0%
 
86.62 1 0.0%
 

2016_return_category
Numeric

Distinct count 102
Unique (%) 0.4%
Missing (%) 12.4%
Missing (n) 3097
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.2853
Minimum -21.11
Maximum 54.81
Zeros (%) 0.6%

Quantile statistics

Minimum -21.11
5-th percentile -0.25
Q1 3.23
Median 6.23
Q3 10.37
95-th percentile 20.78
Maximum 54.81
Range 75.92
Interquartile range 7.14

Descriptive statistics

Standard deviation 6.7941
Coef of variation 0.93259
Kurtosis 6.7273
Mean 7.2853
MAD 4.7939
Skewness 1.4074
Sum 159570
Variance 46.16
Memory size 195.4 KiB
Value Count Frequency (%)  
3.23 2075 8.3%
 
10.37 1137 4.5%
 
14.81 1021 4.1%
 
5.54 739 3.0%
 
8.47 657 2.6%
 
7.34 651 2.6%
 
20.78 627 2.5%
 
0.79 608 2.4%
 
13.3 589 2.4%
 
11.2 576 2.3%
 
Other values (91) 13223 52.9%
 
(Missing) 3097 12.4%
 

Minimum 5 values

Value Count Frequency (%)  
-21.11 52 0.2%
 
-10.6 98 0.4%
 
-2.98 118 0.5%
 
-2.75 97 0.4%
 
-2.14 366 1.5%
 

Maximum 5 values

Value Count Frequency (%)  
26.69 88 0.4%
 
27.3 90 0.4%
 
29.22 70 0.3%
 
32.05 12 0.0%
 
54.81 53 0.2%
 

2016_return_fund
Numeric

Distinct count 3729
Unique (%) 14.9%
Missing (%) 15.7%
Missing (n) 3931
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.2878
Minimum -62.92
Maximum 80.1
Zeros (%) 0.1%

Quantile statistics

Minimum -62.92
5-th percentile -2.22
Q1 2.07
Median 6.32
Q3 10.85
95-th percentile 21.656
Maximum 80.1
Range 143.02
Interquartile range 8.78

Descriptive statistics

Standard deviation 8.1711
Coef of variation 1.1212
Kurtosis 7.1091
Mean 7.2878
MAD 5.7766
Skewness 1.1104
Sum 153550
Variance 66.767
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.29 27 0.1%
 
-0.01 27 0.1%
 
6.4 26 0.1%
 
6.52 26 0.1%
 
5.34 25 0.1%
 
6.45 24 0.1%
 
6.47 24 0.1%
 
7.01 24 0.1%
 
6.55 23 0.1%
 
7.26 23 0.1%
 
Other values (3718) 20820 83.3%
 
(Missing) 3931 15.7%
 

Minimum 5 values

Value Count Frequency (%)  
-62.92 1 0.0%
 
-62.54 1 0.0%
 
-51.74 1 0.0%
 
-51.22 1 0.0%
 
-40.7 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
73.02 1 0.0%
 
75.08 1 0.0%
 
75.97 1 0.0%
 
78.45 1 0.0%
 
80.1 1 0.0%
 

2017_category_return
Numeric

Distinct count 103
Unique (%) 0.4%
Missing (%) 5.7%
Missing (n) 1428
Infinite (%) 0.0%
Infinite (n) 0
Mean 14.848
Minimum -27.04
Maximum 46.78
Zeros (%) 0.0%

Quantile statistics

Minimum -27.04
5-th percentile 1.73
Q1 6.25
Median 14.67
Q3 21.5
95-th percentile 31.58
Maximum 46.78
Range 73.82
Interquartile range 15.25

Descriptive statistics

Standard deviation 9.6485
Coef of variation 0.64982
Kurtosis -0.023909
Mean 14.848
MAD 7.9925
Skewness 0.16038
Sum 350000
Variance 93.094
Memory size 195.4 KiB
Value Count Frequency (%)  
27.67 1285 5.1%
 
20.44 1214 4.9%
 
15.94 1088 4.4%
 
3.71 915 3.7%
 
23.61 807 3.2%
 
34.17 719 2.9%
 
13.21 676 2.7%
 
12.28 672 2.7%
 
25.12 641 2.6%
 
6.47 635 2.5%
 
Other values (92) 14920 59.7%
 
(Missing) 1428 5.7%
 

Minimum 5 values

Value Count Frequency (%)  
-27.04 53 0.2%
 
-5.78 99 0.4%
 
-4.84 71 0.3%
 
0.56 92 0.4%
 
0.77 26 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
35.35 152 0.6%
 
36.19 131 0.5%
 
37.39 71 0.3%
 
42.4 51 0.2%
 
46.78 15 0.1%
 

2017_return_fund
Highly correlated

This variable is highly correlated with 2017_category_return and should be ignored for analysis

Correlation 0.91277

2018_return_category
Numeric

Distinct count 100
Unique (%) 0.4%
Missing (%) 3.2%
Missing (n) 809
Infinite (%) 0.0%
Infinite (n) 0
Mean -6.4862
Minimum -27.27
Maximum 7.19
Zeros (%) 0.0%

Quantile statistics

Minimum -27.27
5-th percentile -16.07
Q1 -9.27
Median -6.25
Q3 -2.09
95-th percentile 0.92
Maximum 7.19
Range 34.46
Interquartile range 7.18

Descriptive statistics

Standard deviation 5.4202
Coef of variation -0.83565
Kurtosis -0.16797
Mean -6.4862
MAD 4.3311
Skewness -0.51425
Sum -156910
Variance 29.379
Memory size 195.4 KiB
Value Count Frequency (%)  
-5.76 1330 5.3%
 
-2.09 1299 5.2%
 
-6.27 1236 4.9%
 
-8.53 1097 4.4%
 
-0.5 943 3.8%
 
-9.64 837 3.3%
 
-16.07 739 3.0%
 
-12.72 676 2.7%
 
-14.59 667 2.7%
 
-2.59 648 2.6%
 
Other values (89) 14719 58.9%
 
(Missing) 809 3.2%
 

Minimum 5 values

Value Count Frequency (%)  
-27.27 71 0.3%
 
-20.68 52 0.2%
 
-19.13 148 0.6%
 
-19.01 92 0.4%
 
-18.34 135 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
1.77 39 0.2%
 
1.91 51 0.2%
 
2.11 177 0.7%
 
2.76 50 0.2%
 
7.19 53 0.2%
 

2018_return_fund
Numeric

Distinct count 3132
Unique (%) 12.5%
Missing (%) 3.8%
Missing (n) 940
Infinite (%) 0.0%
Infinite (n) 0
Mean -6.6868
Minimum -59.1
Maximum 39.47
Zeros (%) 0.1%

Quantile statistics

Minimum -59.1
5-th percentile -18.71
Q1 -10.52
Median -5.795
Q3 -1.62
95-th percentile 1.52
Maximum 39.47
Range 98.57
Interquartile range 8.9

Descriptive statistics

Standard deviation 6.6815
Coef of variation -0.99922
Kurtosis 1.4985
Mean -6.6868
MAD 5.2661
Skewness -0.6087
Sum -160880
Variance 44.643
Memory size 195.4 KiB
Value Count Frequency (%)  
-4.95 28 0.1%
 
0.31 28 0.1%
 
0.53 28 0.1%
 
0.14 28 0.1%
 
-5.17 27 0.1%
 
-7.65 27 0.1%
 
-4.7 27 0.1%
 
0.64 27 0.1%
 
0.63 27 0.1%
 
0.85 26 0.1%
 
Other values (3121) 23787 95.1%
 
(Missing) 940 3.8%
 

Minimum 5 values

Value Count Frequency (%)  
-59.1 1 0.0%
 
-58.6 1 0.0%
 
-48.0 1 0.0%
 
-47.81 1 0.0%
 
-47.73 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
23.6 2 0.0%
 
28.39 1 0.0%
 
29.63 1 0.0%
 
37.94 1 0.0%
 
39.47 1 0.0%
 

3_months_return_category
Highly correlated

This variable is highly correlated with ytd_return_category and should be ignored for analysis

Correlation 1

bond_percentage_of_porfolio
Numeric

Distinct count 2767
Unique (%) 11.1%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 30.782
Minimum 0
Maximum 100
Zeros (%) 46.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 2.17
Q3 64.06
95-th percentile 98.44
Maximum 100
Range 100
Interquartile range 64.06

Descriptive statistics

Standard deviation 38.687
Coef of variation 1.2568
Kurtosis -1.0773
Mean 30.782
MAD 34.38
Skewness 0.78545
Sum 766040
Variance 1496.7
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 11583 46.3%
 
100.0 216 0.9%
 
99.99 49 0.2%
 
0.01 37 0.1%
 
11.63 35 0.1%
 
0.08 35 0.1%
 
0.14 29 0.1%
 
94.46 27 0.1%
 
99.98 25 0.1%
 
94.99 25 0.1%
 
Other values (2756) 12825 51.3%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 11583 46.3%
 
0.01 37 0.1%
 
0.02 18 0.1%
 
0.03 20 0.1%
 
0.04 21 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
99.95 6 0.0%
 
99.97 25 0.1%
 
99.98 25 0.1%
 
99.99 49 0.2%
 
100.0 216 0.9%
 

cash_percent_of_portfolio
Numeric

Distinct count 2083
Unique (%) 8.3%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.3818
Minimum 0
Maximum 100
Zeros (%) 5.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1.24
Median 3.14
Q3 7.04
95-th percentile 32.21
Maximum 100
Range 100
Interquartile range 5.8

Descriptive statistics

Standard deviation 12.9
Coef of variation 1.7475
Kurtosis 18.684
Mean 7.3818
MAD 7.4007
Skewness 3.8862
Sum 183700
Variance 166.4
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 1253 5.0%
 
0.01 124 0.5%
 
1.64 81 0.3%
 
1.4 75 0.3%
 
1.62 74 0.3%
 
0.88 74 0.3%
 
100.0 71 0.3%
 
0.99 70 0.3%
 
3.15 69 0.3%
 
1.59 67 0.3%
 
Other values (2072) 22928 91.7%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 1253 5.0%
 
0.01 124 0.5%
 
0.02 55 0.2%
 
0.03 43 0.2%
 
0.04 14 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
99.04 4 0.0%
 
99.43 1 0.0%
 
99.52 4 0.0%
 
99.91 2 0.0%
 
100.0 71 0.3%
 

category_ratio_net_annual_expense
Numeric

Distinct count 74
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.0135
Minimum 0.39
Maximum 2.6
Zeros (%) 0.0%

Quantile statistics

Minimum 0.39
5-th percentile 0.45
Q1 0.81
Median 1.02
Q3 1.18
95-th percentile 1.57
Maximum 2.6
Range 2.21
Interquartile range 0.37

Descriptive statistics

Standard deviation 0.329
Coef of variation 0.32461
Kurtosis 2.2873
Mean 1.0135
MAD 0.2355
Skewness 0.78106
Sum 25338
Variance 0.10824
Memory size 195.4 KiB
Value Count Frequency (%)  
1.06 1971 7.9%
 
0.94 1515 6.1%
 
0.45 1351 5.4%
 
1.11 1141 4.6%
 
1.01 1133 4.5%
 
1.0 1126 4.5%
 
0.76 960 3.8%
 
1.36 894 3.6%
 
0.75 864 3.5%
 
0.82 793 3.2%
 
Other values (64) 13252 53.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.39 199 0.8%
 
0.43 135 0.5%
 
0.44 240 1.0%
 
0.45 1351 5.4%
 
0.46 249 1.0%
 

Maximum 5 values

Value Count Frequency (%)  
2.08 8 0.0%
 
2.17 392 1.6%
 
2.18 53 0.2%
 
2.28 4 0.0%
 
2.6 15 0.1%
 

category_return_1month
Numeric

Distinct count 88
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.87087
Minimum -3.49
Maximum 9.68
Zeros (%) 0.8%

Quantile statistics

Minimum -3.49
5-th percentile -1.21
Q1 0.46
Median 1.11
Q3 1.38
95-th percentile 2.12
Maximum 9.68
Range 13.17
Interquartile range 0.92

Descriptive statistics

Standard deviation 1.2061
Coef of variation 1.3849
Kurtosis 4.37
Mean 0.87087
MAD 0.79881
Skewness -0.76316
Sum 21671
Variance 1.4547
Memory size 195.4 KiB
Value Count Frequency (%)  
1.16 1442 5.8%
 
2.12 1333 5.3%
 
1.29 1270 5.1%
 
0.7 1192 4.8%
 
1.7 1171 4.7%
 
0.46 1121 4.5%
 
1.11 1103 4.4%
 
1.14 757 3.0%
 
-2.31 683 2.7%
 
0.8 666 2.7%
 
Other values (77) 14147 56.6%
 

Minimum 5 values

Value Count Frequency (%)  
-3.49 74 0.3%
 
-3.24 15 0.1%
 
-3.2 418 1.7%
 
-2.55 2 0.0%
 
-2.31 683 2.7%
 

Maximum 5 values

Value Count Frequency (%)  
3.67 202 0.8%
 
3.82 54 0.2%
 
4.2 106 0.4%
 
5.19 24 0.1%
 
9.68 15 0.1%
 

category_return_1year
Numeric

Distinct count 100
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7365
Minimum -10.92
Maximum 17.48
Zeros (%) 0.8%

Quantile statistics

Minimum -10.92
5-th percentile -7.87
Q1 0.66
Median 3.07
Q3 4.52
95-th percentile 10.71
Maximum 17.48
Range 28.4
Interquartile range 3.86

Descriptive statistics

Standard deviation 5.0265
Coef of variation 1.8368
Kurtosis 0.89117
Mean 2.7365
MAD 3.5542
Skewness -0.31835
Sum 68098
Variance 25.265
Memory size 195.4 KiB
Value Count Frequency (%)  
10.71 1333 5.3%
 
6.9 1270 5.1%
 
4.48 1121 4.5%
 
3.98 1113 4.5%
 
1.85 850 3.4%
 
-5.01 773 3.1%
 
-9.31 757 3.0%
 
3.9 708 2.8%
 
-0.03 683 2.7%
 
4.33 666 2.7%
 
Other values (89) 15611 62.4%
 

Minimum 5 values

Value Count Frequency (%)  
-10.92 51 0.2%
 
-10.58 102 0.4%
 
-10.06 53 0.2%
 
-9.96 136 0.5%
 
-9.52 75 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
10.71 1333 5.3%
 
13.05 106 0.4%
 
14.41 199 0.8%
 
17.08 225 0.9%
 
17.48 50 0.2%
 

category_return_2015
Numeric

Distinct count 99
Unique (%) 0.4%
Missing (%) 18.4%
Missing (n) 4601
Infinite (%) 0.0%
Infinite (n) 0
Mean -2.253
Minimum -34.98
Maximum 11.97
Zeros (%) 0.0%

Quantile statistics

Minimum -34.98
5-th percentile -13.79
Q1 -4.01
Median -1.69
Q3 -0.26
95-th percentile 3.6
Maximum 11.97
Range 46.95
Interquartile range 3.75

Descriptive statistics

Standard deviation 4.9993
Coef of variation -2.219
Kurtosis 12.296
Mean -2.253
MAD 2.9549
Skewness -2.667
Sum -45958
Variance 24.993
Memory size 195.4 KiB
Value Count Frequency (%)  
3.6 1144 4.6%
 
-1.07 1088 4.4%
 
-4.05 971 3.9%
 
-0.26 801 3.2%
 
-1.59 738 3.0%
 
-1.69 688 2.8%
 
-1.93 625 2.5%
 
-13.79 623 2.5%
 
-5.38 562 2.2%
 
-4.01 560 2.2%
 
Other values (88) 12599 50.4%
 
(Missing) 4601 18.4%
 

Minimum 5 values

Value Count Frequency (%)  
-34.98 88 0.4%
 
-29.95 12 0.0%
 
-27.39 58 0.2%
 
-23.99 92 0.4%
 
-23.25 53 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
4.15 12 0.0%
 
5.21 138 0.6%
 
7.05 100 0.4%
 
8.05 95 0.4%
 
11.97 22 0.1%
 

fund_return_3months
Highly correlated

This variable is highly correlated with ytd_return_fund and should be ignored for analysis

Correlation 0.97222

fund_return_3years
Numeric

Distinct count 2650
Unique (%) 10.6%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.9992
Minimum -36.02
Maximum 38.42
Zeros (%) 6.2%

Quantile statistics

Minimum -36.02
5-th percentile 0
Q1 2.81
Median 6.82
Q3 10.21
95-th percentile 16.16
Maximum 38.42
Range 74.44
Interquartile range 7.4

Descriptive statistics

Standard deviation 5.4604
Coef of variation 0.78016
Kurtosis 3.82
Mean 6.9992
MAD 4.1873
Skewness 0.065025
Sum 174170
Variance 29.816
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 1540 6.2%
 
7.12 35 0.1%
 
6.32 35 0.1%
 
1.65 33 0.1%
 
8.99 31 0.1%
 
1.67 31 0.1%
 
2.21 30 0.1%
 
7.17 30 0.1%
 
9.65 30 0.1%
 
8.5 30 0.1%
 
Other values (2639) 23060 92.2%
 
(Missing) 115 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-36.02 1 0.0%
 
-35.33 1 0.0%
 
-34.52 1 0.0%
 
-34.12 1 0.0%
 
-33.88 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
34.35 1 0.0%
 
34.57 1 0.0%
 
35.18 2 0.0%
 
37.04 1 0.0%
 
38.42 1 0.0%
 

greatstone_rating
Numeric

Distinct count 7
Unique (%) 0.0%
Missing (%) 20.0%
Missing (n) 5000
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.8397
Minimum 0
Maximum 5
Zeros (%) 5.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 2
Median 3
Q3 4
95-th percentile 5
Maximum 5
Range 5
Interquartile range 2

Descriptive statistics

Standard deviation 1.2774
Coef of variation 0.44984
Kurtosis -0.17408
Mean 2.8397
MAD 0.99599
Skewness -0.448
Sum 56795
Variance 1.6319
Memory size 195.4 KiB
Value Count Frequency (%)  
3.0 6786 27.1%
 
4.0 4614 18.5%
 
2.0 4230 16.9%
 
5.0 1629 6.5%
 
1.0 1376 5.5%
 
0.0 1365 5.5%
 
(Missing) 5000 20.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 1365 5.5%
 
1.0 1376 5.5%
 
2.0 4230 16.9%
 
3.0 6786 27.1%
 
4.0 4614 18.5%
 

Maximum 5 values

Value Count Frequency (%)  
1.0 1376 5.5%
 
2.0 4230 16.9%
 
3.0 6786 27.1%
 
4.0 4614 18.5%
 
5.0 1629 6.5%
 

mmc
Categorical

Distinct count 5689
Unique (%) 22.8%
Missing (%) 0.5%
Missing (n) 114
0
6008
828.01
 
75
2,193.13
 
41
Other values (5685)
18762
(Missing)
 
114
Value Count Frequency (%)  
0 6008 24.0%
 
828.01 75 0.3%
 
2,193.13 41 0.2%
 
9,234.14 34 0.1%
 
88,146.69 17 0.1%
 
95,232.43 17 0.1%
 
1,063.09 17 0.1%
 
43,954.74 17 0.1%
 
39,247.34 17 0.1%
 
23,042.48 17 0.1%
 
Other values (5678) 18626 74.5%
 
(Missing) 114 0.5%
 

pb_ratio
Numeric

Distinct count 604
Unique (%) 2.4%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.8543
Minimum 0
Maximum 123.3
Zeros (%) 24.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0.56
Median 1.85
Q3 2.38
95-th percentile 4.5
Maximum 123.3
Range 123.3
Interquartile range 1.82

Descriptive statistics

Standard deviation 2.9842
Coef of variation 1.6094
Kurtosis 1211.6
Mean 1.8543
MAD 1.1158
Skewness 30.129
Sum 46145
Variance 8.9057
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 6059 24.2%
 
2.0 235 0.9%
 
2.01 218 0.9%
 
1.94 181 0.7%
 
2.13 180 0.7%
 
1.96 173 0.7%
 
2.03 172 0.7%
 
1.92 170 0.7%
 
2.02 167 0.7%
 
1.99 158 0.6%
 
Other values (593) 17173 68.7%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 6059 24.2%
 
0.12 2 0.0%
 
0.26 7 0.0%
 
0.27 6 0.0%
 
0.29 5 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
10.77 1 0.0%
 
11.17 2 0.0%
 
14.07 4 0.0%
 
22.47 17 0.1%
 
123.3 11 0.0%
 

pc_ratio
Categorical

Distinct count 1584
Unique (%) 6.3%
Missing (%) 0.5%
Missing (n) 114
0
4144
0.0
 
1900
6.99
 
98
Other values (1580)
18744
(Missing)
 
114
Value Count Frequency (%)  
0 4144 16.6%
 
0.0 1900 7.6%
 
6.99 98 0.4%
 
7.18 98 0.4%
 
0.46 92 0.4%
 
7.21 81 0.3%
 
7.63 78 0.3%
 
7.54 77 0.3%
 
7.23 76 0.3%
 
7.96 76 0.3%
 
Other values (1573) 18166 72.7%
 
(Missing) 114 0.5%
 

pe_ratio
Categorical

Distinct count 1782
Unique (%) 7.1%
Missing (%) 0.5%
Missing (n) 114
0
4128
0.0
 
1910
3.65
 
92
Other values (1778)
18756
(Missing)
 
114
Value Count Frequency (%)  
0 4128 16.5%
 
0.0 1910 7.6%
 
3.65 92 0.4%
 
15.14 89 0.4%
 
15.37 87 0.3%
 
15.87 86 0.3%
 
17.05 69 0.3%
 
16.15 67 0.3%
 
16.57 66 0.3%
 
15.09 66 0.3%
 
Other values (1771) 18226 72.9%
 
(Missing) 114 0.5%
 

portfolio_convertable
Numeric

Distinct count 400
Unique (%) 1.6%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.57097
Minimum 0
Maximum 98.86
Zeros (%) 68.1%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0.07
95-th percentile 1.29
Maximum 98.86
Range 98.86
Interquartile range 0.07

Descriptive statistics

Standard deviation 4.8273
Coef of variation 8.4545
Kurtosis 231.33
Mean 0.57097
MAD 0.96172
Skewness 14.695
Sum 14209
Variance 23.303
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 17032 68.1%
 
0.01 309 1.2%
 
0.08 290 1.2%
 
0.02 275 1.1%
 
0.06 274 1.1%
 
0.05 249 1.0%
 
0.03 238 1.0%
 
0.1 233 0.9%
 
0.07 233 0.9%
 
0.09 221 0.9%
 
Other values (389) 5532 22.1%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 17032 68.1%
 
0.01 309 1.2%
 
0.02 275 1.1%
 
0.03 238 1.0%
 
0.04 207 0.8%
 

Maximum 5 values

Value Count Frequency (%)  
83.05 3 0.0%
 
83.23 5 0.0%
 
83.49 2 0.0%
 
83.89 3 0.0%
 
98.86 4 0.0%
 

portfolio_others
Numeric

Distinct count 723
Unique (%) 2.9%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.0558
Minimum 0
Maximum 98.84
Zeros (%) 56.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0.28
95-th percentile 5.12
Maximum 98.84
Range 98.84
Interquartile range 0.28

Descriptive statistics

Standard deviation 4.4478
Coef of variation 4.2128
Kurtosis 128.3
Mean 1.0558
MAD 1.6447
Skewness 9.7283
Sum 26274
Variance 19.783
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 14058 56.2%
 
0.04 492 2.0%
 
0.01 484 1.9%
 
0.02 374 1.5%
 
0.05 355 1.4%
 
0.03 353 1.4%
 
0.06 267 1.1%
 
0.08 216 0.9%
 
0.07 193 0.8%
 
0.09 180 0.7%
 
Other values (712) 7914 31.7%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 14058 56.2%
 
0.01 484 1.9%
 
0.02 374 1.5%
 
0.03 353 1.4%
 
0.04 492 2.0%
 

Maximum 5 values

Value Count Frequency (%)  
64.87 7 0.0%
 
66.96 1 0.0%
 
77.87 7 0.0%
 
93.57 4 0.0%
 
98.84 1 0.0%
 

portfolio_preferred
Numeric

Distinct count 380
Unique (%) 1.5%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.31252
Minimum 0
Maximum 80.87
Zeros (%) 73.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0.01
95-th percentile 0.9675
Maximum 80.87
Range 80.87
Interquartile range 0.01

Descriptive statistics

Standard deviation 2.1508
Coef of variation 6.8821
Kurtosis 452.32
Mean 0.31252
MAD 0.5342
Skewness 17.687
Sum 7777.3
Variance 4.6258
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 18290 73.2%
 
0.01 835 3.3%
 
0.02 467 1.9%
 
0.03 306 1.2%
 
0.1 212 0.8%
 
0.04 204 0.8%
 
0.05 200 0.8%
 
0.07 195 0.8%
 
0.08 178 0.7%
 
0.06 161 0.6%
 
Other values (369) 3838 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 18290 73.2%
 
0.01 835 3.3%
 
0.02 467 1.9%
 
0.03 306 1.2%
 
0.04 204 0.8%
 

Maximum 5 values

Value Count Frequency (%)  
45.86 2 0.0%
 
53.32 2 0.0%
 
56.65 4 0.0%
 
62.37 1 0.0%
 
80.87 3 0.0%
 

ps_ratio
Categorical

Distinct count 556
Unique (%) 2.2%
Missing (%) 0.5%
Missing (n) 114
0.0
3959
0
 
2026
1.49
 
273
Other values (552)
18628
Value Count Frequency (%)  
0.0 3959 15.8%
 
0 2026 8.1%
 
1.49 273 1.1%
 
1.47 252 1.0%
 
1.51 249 1.0%
 
1.45 238 1.0%
 
1.5 221 0.9%
 
0.99 193 0.8%
 
1.31 183 0.7%
 
1.54 180 0.7%
 
Other values (545) 17112 68.4%
 

stock_percent_of_portfolio
Numeric

Distinct count 2739
Unique (%) 11.0%
Missing (%) 0.5%
Missing (n) 114
Infinite (%) 0.0%
Infinite (n) 0
Mean 59.122
Minimum 0
Maximum 100
Zeros (%) 20.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0.5225
Median 82.65
Q3 97.6
95-th percentile 99.6
Maximum 100
Range 100
Interquartile range 97.078

Descriptive statistics

Standard deviation 42.251
Coef of variation 0.71464
Kurtosis -1.5795
Mean 59.122
MAD 39.172
Skewness -0.44936
Sum 1471300
Variance 1785.1
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 5145 20.6%
 
100.0 444 1.8%
 
0.01 144 0.6%
 
0.02 84 0.3%
 
0.03 84 0.3%
 
97.8 62 0.2%
 
99.05 61 0.2%
 
0.04 52 0.2%
 
98.78 51 0.2%
 
99.29 50 0.2%
 
Other values (2728) 18709 74.8%
 
(Missing) 114 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 5145 20.6%
 
0.01 144 0.6%
 
0.02 84 0.3%
 
0.03 84 0.3%
 
0.04 52 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
99.96 12 0.0%
 
99.97 14 0.1%
 
99.98 27 0.1%
 
99.99 34 0.1%
 
100.0 444 1.8%
 

tag
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

years_down
Numeric

Distinct count 27
Unique (%) 0.1%
Missing (%) 6.6%
Missing (n) 1641
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.2425
Minimum 1
Maximum 28
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 3
Q3 4
95-th percentile 7
Maximum 28
Range 27
Interquartile range 3

Descriptive statistics

Standard deviation 2.3227
Coef of variation 0.71634
Kurtosis 8.241
Mean 3.2425
MAD 1.746
Skewness 1.9961
Sum 75742
Variance 5.3951
Memory size 195.4 KiB
Value Count Frequency (%)  
1.0 5945 23.8%
 
2.0 4950 19.8%
 
3.0 3752 15.0%
 
4.0 3286 13.1%
 
5.0 2162 8.6%
 
6.0 1330 5.3%
 
7.0 828 3.3%
 
8.0 472 1.9%
 
9.0 261 1.0%
 
10.0 110 0.4%
 
Other values (16) 263 1.1%
 
(Missing) 1641 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
1.0 5945 23.8%
 
2.0 4950 19.8%
 
3.0 3752 15.0%
 
4.0 3286 13.1%
 
5.0 2162 8.6%
 

Maximum 5 values

Value Count Frequency (%)  
22.0 3 0.0%
 
23.0 2 0.0%
 
24.0 5 0.0%
 
26.0 1 0.0%
 
28.0 1 0.0%
 

years_up
Numeric

Distinct count 68
Unique (%) 0.3%
Missing (%) 7.2%
Missing (n) 1812
Infinite (%) 0.0%
Infinite (n) 0
Mean 8.4193
Minimum 1
Maximum 70
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 3
Median 7
Q3 12
95-th percentile 21
Maximum 70
Range 69
Interquartile range 9

Descriptive statistics

Standard deviation 6.9673
Coef of variation 0.82754
Kurtosis 6.8523
Mean 8.4193
MAD 5.303
Skewness 1.8537
Sum 195230
Variance 48.544
Memory size 195.4 KiB
Value Count Frequency (%)  
2.0 3100 12.4%
 
1.0 1896 7.6%
 
3.0 1718 6.9%
 
7.0 1598 6.4%
 
5.0 1598 6.4%
 
4.0 1537 6.1%
 
8.0 1302 5.2%
 
6.0 1296 5.2%
 
12.0 1016 4.1%
 
9.0 972 3.9%
 
Other values (57) 7155 28.6%
 
(Missing) 1812 7.2%
 

Minimum 5 values

Value Count Frequency (%)  
1.0 1896 7.6%
 
2.0 3100 12.4%
 
3.0 1718 6.9%
 
4.0 1537 6.1%
 
5.0 1598 6.4%
 

Maximum 5 values

Value Count Frequency (%)  
66.0 3 0.0%
 
67.0 1 0.0%
 
68.0 2 0.0%
 
69.0 1 0.0%
 
70.0 2 0.0%
 

ytd_return_category
Numeric

Distinct count 100
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.1687
Minimum -17.38
Maximum 20.95
Zeros (%) 0.8%

Quantile statistics

Minimum -17.38
5-th percentile 1.71
Q1 5.02
Median 10.24
Q3 12.94
95-th percentile 17.01
Maximum 20.95
Range 38.33
Interquartile range 7.92

Descriptive statistics

Standard deviation 4.9997
Coef of variation 0.5453
Kurtosis 0.50549
Mean 9.1687
MAD 4.2093
Skewness -0.31774
Sum 228160
Variance 24.997
Memory size 195.4 KiB
Value Count Frequency (%)  
12.94 1655 6.6%
 
15.67 1333 5.3%
 
11.29 1121 4.5%
 
3.13 957 3.8%
 
12.27 875 3.5%
 
10.27 757 3.0%
 
8.89 708 2.8%
 
13.34 683 2.7%
 
10.24 680 2.7%
 
4.17 672 2.7%
 
Other values (89) 15444 61.8%
 

Minimum 5 values

Value Count Frequency (%)  
-17.38 53 0.2%
 
0.0 188 0.8%
 
0.12 142 0.6%
 
1.02 164 0.7%
 
1.04 93 0.4%
 

Maximum 5 values

Value Count Frequency (%)  
17.01 645 2.6%
 
18.19 578 2.3%
 
19.1 106 0.4%
 
19.73 160 0.6%
 
20.95 54 0.2%
 

ytd_return_fund
Numeric

Distinct count 2744
Unique (%) 11.0%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.2898
Minimum -36.3
Maximum 46.29
Zeros (%) 0.1%

Quantile statistics

Minimum -36.3
5-th percentile 1.37
Q1 4.43
Median 9.82
Q3 13.08
95-th percentile 18.32
Maximum 46.29
Range 82.59
Interquartile range 8.65

Descriptive statistics

Standard deviation 5.7977
Coef of variation 0.6241
Kurtosis 2.2354
Mean 9.2898
MAD 4.6294
Skewness -0.11739
Sum 231180
Variance 33.613
Memory size 195.4 KiB
Value Count Frequency (%)  
11.88 36 0.1%
 
2.45 36 0.1%
 
11.33 34 0.1%
 
10.94 34 0.1%
 
11.76 33 0.1%
 
2.76 32 0.1%
 
2.62 32 0.1%
 
11.21 31 0.1%
 
3.4 31 0.1%
 
10.27 31 0.1%
 
Other values (2733) 24555 98.2%
 
(Missing) 115 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-36.3 1 0.0%
 
-36.14 1 0.0%
 
-27.8 1 0.0%
 
-27.79 1 0.0%
 
-27.7 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
38.96 1 0.0%
 
41.33 1 0.0%
 
45.78 1 0.0%
 
45.88 1 0.0%
 
46.29 1 0.0%
 

Correlations

Sample

2014_category_return 2012_return_category years_up 2018_return_category tag category_return_1year cash_percent_of_portfolio pc_ratio 2011_return_category ytd_return_fund years_down 2014_return_fund category_return_1month 2013_return_fund fund_return_3months ytd_return_category pb_ratio 2017_category_return 1_year_return_fund pe_ratio 2015_return_fund portfolio_convertable 3_months_return_category portfolio_others 2016_return_fund mmc stock_percent_of_portfolio 2016_return_category ps_ratio 2011_return_fund 2010_return_fund fund_return_3years 2012_fund_return 2018_return_fund 2017_return_fund greatstone_rating category_ratio_net_annual_expense category_return_2015 1_month_fund_return bond_percentage_of_porfolio portfolio_preferred 2010_return_category 2013_category_return
0 NaN NaN 1.0 -16.32 67922 13.05 1.19 5.91 NaN 20.19 2.0 NaN 4.20 NaN 20.19 19.10 1.71 -5.78 18.40 14.51 NaN 0.00 19.10 0.00 16.14 19,857.41 98.81 27.30 1.31 NaN NaN 4.24 NaN -12.23 -3.31 NaN 1.75 -34.98 4.12 0.00 0.00 NaN NaN
1 10.00 15.34 5.0 -2.09 134783 10.71 0.10 15.95 NaN 16.79 1.0 14.25 2.12 35.46 16.79 15.67 5.30 27.67 12.18 18.88 5.60 0.00 15.67 0.00 1.64 72,347.03 99.90 3.23 3.38 NaN NaN 14.39 NaN -2.62 26.39 3.0 1.06 3.60 2.33 0.00 0.00 NaN 33.92
2 10.00 15.34 26.0 -2.09 61271 10.71 2.00 15.97 -2.46 17.13 5.0 11.04 2.12 30.42 17.13 15.67 5.40 27.67 19.77 23.27 3.68 0.00 15.67 0.22 2.32 68,857.43 97.12 3.23 3.67 -2.23 17.23 16.42 15.52 5.04 25.79 4.0 1.06 3.60 3.77 0.58 0.08 15.53 33.92
3 10.21 14.57 11.0 -8.53 64412 4.48 6.13 8.93 -0.75 11.63 2.0 12.32 0.46 29.31 11.63 11.29 2.23 15.94 7.11 12.7 2.09 0.00 11.29 0.00 14.66 43,266.62 93.87 14.81 1.63 0.08 15.63 6.85 17.66 -7.54 8.53 3.0 1.00 -4.05 1.46 0.00 0.00 13.66 31.21
4 NaN NaN 1.0 -7.04 184058 3.17 6.59 7.59 NaN 10.25 1.0 NaN 1.28 NaN 10.25 10.36 2.02 18.43 3.11 14.74 NaN 0.09 10.36 0.80 NaN 43,747.9 67.41 NaN 1.4 NaN NaN 0.00 NaN -7.37 17.52 0.0 0.45 NaN 1.28 24.97 0.02 NaN NaN
In [13]:
#return_3years contains 17 columns which give information about 3 year return and ratios
return_3year = pd.read_csv('Hackathon_Files/external/return_3year.csv')
pandas_profiling.ProfileReport(return_3year)
Out[13]:

Overview

Dataset info

Number of variables 17
Number of observations 25000
Total Missing (%) 2.9%
Total size in memory 3.2 MiB
Average record size in memory 136.0 B

Variables types

Numeric 15
Categorical 1
Boolean 0
Date 0
Text (Unique) 0
Rejected 1
Unsupported 0

Warnings

Variables

3_years_alpha_category
Numeric

Distinct count 17
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean -0.004579
Minimum -0.12
Maximum 0.11
Zeros (%) 21.6%

Quantile statistics

Minimum -0.12
5-th percentile -0.05
Q1 -0.01
Median -0.01
Q3 0
95-th percentile 0.03
Maximum 0.11
Range 0.23
Interquartile range 0.01

Descriptive statistics

Standard deviation 0.023468
Coef of variation -5.1252
Kurtosis 2.9032
Mean -0.004579
MAD 0.016484
Skewness 0.039778
Sum -113.99
Variance 0.00055076
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.01 7323 29.3%
 
-0.0 5394 21.6%
 
-0.02 3339 13.4%
 
0.01 2609 10.4%
 
0.03 1700 6.8%
 
-0.05 1071 4.3%
 
0.02 741 3.0%
 
0.05 714 2.9%
 
-0.04 563 2.3%
 
-0.03 525 2.1%
 
Other values (6) 915 3.7%
 

Minimum 5 values

Value Count Frequency (%)  
-0.12 77 0.3%
 
-0.06 418 1.7%
 
-0.05 1071 4.3%
 
-0.04 563 2.3%
 
-0.03 525 2.1%
 

Maximum 5 values

Value Count Frequency (%)  
0.04 230 0.9%
 
0.05 714 2.9%
 
0.06 15 0.1%
 
0.08 160 0.6%
 
0.11 15 0.1%
 

3_years_alpha_fund
Numeric

Distinct count 2088
Unique (%) 8.4%
Missing (%) 6.6%
Missing (n) 1648
Infinite (%) 0.0%
Infinite (n) 0
Mean -0.57702
Minimum -36.24
Maximum 19.15
Zeros (%) 0.2%

Quantile statistics

Minimum -36.24
5-th percentile -6.1945
Q1 -2.1
Median -0.59
Q3 0.89
95-th percentile 5.07
Maximum 19.15
Range 55.39
Interquartile range 2.99

Descriptive statistics

Standard deviation 3.3798
Coef of variation -5.8574
Kurtosis 5.122
Mean -0.57702
MAD 2.3343
Skewness -0.32443
Sum -13475
Variance 11.423
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.88 73 0.3%
 
-0.38 69 0.3%
 
-0.72 69 0.3%
 
-0.42 69 0.3%
 
-0.7 67 0.3%
 
-0.53 67 0.3%
 
-0.28 66 0.3%
 
-0.46 65 0.3%
 
-0.4 65 0.3%
 
-0.96 65 0.3%
 
Other values (2077) 22677 90.7%
 
(Missing) 1648 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
-36.24 1 0.0%
 
-35.25 1 0.0%
 
-33.59 1 0.0%
 
-30.89 1 0.0%
 
-29.83 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
17.5 1 0.0%
 
18.15 1 0.0%
 
18.55 1 0.0%
 
18.82 1 0.0%
 
19.15 1 0.0%
 

3_years_return_category
Numeric

Distinct count 103
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.4618
Minimum -19.79
Maximum 21.78
Zeros (%) 0.8%

Quantile statistics

Minimum -19.79
5-th percentile 1.62
Q1 4.36
Median 7.44
Q3 10.01
95-th percentile 15.35
Maximum 21.78
Range 41.57
Interquartile range 5.65

Descriptive statistics

Standard deviation 4.4433
Coef of variation 0.59547
Kurtosis 2.4665
Mean 7.4618
MAD 3.5023
Skewness -0.12176
Sum 185690
Variance 19.743
Memory size 195.4 KiB
Value Count Frequency (%)  
15.35 1333 5.3%
 
11.84 1270 5.1%
 
10.01 1121 4.5%
 
2.37 957 3.8%
 
9.96 850 3.4%
 
9.11 757 3.0%
 
7.44 708 2.8%
 
10.17 683 2.7%
 
6.62 680 2.7%
 
6.97 666 2.7%
 
Other values (92) 15860 63.4%
 

Minimum 5 values

Value Count Frequency (%)  
-19.79 53 0.2%
 
-2.14 114 0.5%
 
-0.05 75 0.3%
 
0.0 188 0.8%
 
0.71 142 0.6%
 

Maximum 5 values

Value Count Frequency (%)  
14.18 578 2.3%
 
15.35 1333 5.3%
 
15.88 645 2.6%
 
16.77 15 0.1%
 
21.78 160 0.6%
 

3_years_return_mean_annual_category
Numeric

Distinct count 5
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.006514
Minimum -0.02
Maximum 0.02
Zeros (%) 34.8%

Quantile statistics

Minimum -0.02
5-th percentile 0
Q1 0
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.02
Range 0.04
Interquartile range 0.01

Descriptive statistics

Standard deviation 0.0050391
Coef of variation 0.77357
Kurtosis 0.11524
Mean 0.006514
MAD 0.0046628
Skewness -0.71749
Sum 162.16
Variance 2.5392e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 15972 63.9%
 
0.0 8694 34.8%
 
0.02 175 0.7%
 
-0.02 53 0.2%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 8694 34.8%
 
0.01 15972 63.9%
 
0.02 175 0.7%
 

Maximum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 8694 34.8%
 
0.01 15972 63.9%
 
0.02 175 0.7%
 

3_years_return_mean_annual_fund
Numeric

Distinct count 389
Unique (%) 1.6%
Missing (%) 6.6%
Missing (n) 1648
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.63623
Minimum -3.19
Maximum 2.98
Zeros (%) 0.1%

Quantile statistics

Minimum -3.19
5-th percentile 0.09
Q1 0.33
Median 0.62
Q3 0.89
95-th percentile 1.36
Maximum 2.98
Range 6.17
Interquartile range 0.56

Descriptive statistics

Standard deviation 0.43605
Coef of variation 0.68536
Kurtosis 5.1738
Mean 0.63623
MAD 0.33086
Skewness -0.1965
Sum 14857
Variance 0.19014
Memory size 195.4 KiB
Value Count Frequency (%)  
0.14 266 1.1%
 
0.8 265 1.1%
 
0.63 255 1.0%
 
0.62 255 1.0%
 
0.17 252 1.0%
 
0.15 249 1.0%
 
0.76 249 1.0%
 
0.18 245 1.0%
 
0.19 243 1.0%
 
0.2 243 1.0%
 
Other values (378) 20830 83.3%
 
(Missing) 1648 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
-3.19 1 0.0%
 
-3.14 1 0.0%
 
-3.11 1 0.0%
 
-3.09 2 0.0%
 
-2.92 3 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
2.81 1 0.0%
 
2.82 1 0.0%
 
2.86 2 0.0%
 
2.9 1 0.0%
 
2.98 1 0.0%
 

3years_category_r_squared
Numeric

Distinct count 58
Unique (%) 0.2%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.71632
Minimum 0
Maximum 0.97
Zeros (%) 0.7%

Quantile statistics

Minimum 0
5-th percentile 0.08
Q1 0.63
Median 0.81
Q3 0.89
95-th percentile 0.96
Maximum 0.97
Range 0.97
Interquartile range 0.26

Descriptive statistics

Standard deviation 0.25094
Coef of variation 0.35032
Kurtosis 1.2269
Mean 0.71632
MAD 0.1909
Skewness -1.4381
Sum 17832
Variance 0.06297
Memory size 195.4 KiB
Value Count Frequency (%)  
0.84 2637 10.5%
 
0.88 1659 6.6%
 
0.95 1488 6.0%
 
0.92 1485 5.9%
 
0.81 1395 5.6%
 
0.67 1003 4.0%
 
0.89 959 3.8%
 
0.04 945 3.8%
 
0.6 929 3.7%
 
0.96 877 3.5%
 
Other values (47) 11517 46.1%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 187 0.7%
 
0.01 30 0.1%
 
0.03 57 0.2%
 
0.04 945 3.8%
 
0.08 142 0.6%
 

Maximum 5 values

Value Count Frequency (%)  
0.93 58 0.2%
 
0.94 419 1.7%
 
0.95 1488 6.0%
 
0.96 877 3.5%
 
0.97 572 2.3%
 

3years_category_std
Numeric

Distinct count 23
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.088854
Minimum 0
Maximum 0.33
Zeros (%) 1.4%

Quantile statistics

Minimum 0
5-th percentile 0.03
Q1 0.04
Median 0.09
Q3 0.13
95-th percentile 0.16
Maximum 0.33
Range 0.33
Interquartile range 0.09

Descriptive statistics

Standard deviation 0.047886
Coef of variation 0.53893
Kurtosis 0.46609
Mean 0.088854
MAD 0.040696
Skewness 0.32364
Sum 2211.9
Variance 0.0022931
Memory size 195.4 KiB
Value Count Frequency (%)  
0.11 4583 18.3%
 
0.13 3456 13.8%
 
0.03 3321 13.3%
 
0.16 2100 8.4%
 
0.04 2057 8.2%
 
0.05 1584 6.3%
 
0.07 1546 6.2%
 
0.09 1537 6.1%
 
0.08 797 3.2%
 
0.01 795 3.2%
 
Other values (12) 3118 12.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 352 1.4%
 
0.01 795 3.2%
 
0.02 39 0.2%
 
0.03 3321 13.3%
 
0.04 2057 8.2%
 

Maximum 5 values

Value Count Frequency (%)  
0.18 222 0.9%
 
0.2 53 0.2%
 
0.24 77 0.3%
 
0.27 15 0.1%
 
0.33 57 0.2%
 

3years_fund_r_squared
Numeric

Distinct count 6897
Unique (%) 27.6%
Missing (%) 6.6%
Missing (n) 1648
Infinite (%) 0.0%
Infinite (n) 0
Mean 72.558
Minimum 0
Maximum 100
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 4.53
Q1 64.24
Median 81.91
Q3 92.7
95-th percentile 97.95
Maximum 100
Range 100
Interquartile range 28.46

Descriptive statistics

Standard deviation 27.191
Coef of variation 0.37475
Kurtosis 0.88892
Mean 72.558
MAD 20.934
Skewness -1.3644
Sum 1694400
Variance 739.36
Memory size 195.4 KiB
Value Count Frequency (%)  
100.0 65 0.3%
 
99.99 31 0.1%
 
97.42 23 0.1%
 
97.34 22 0.1%
 
97.01 21 0.1%
 
97.68 21 0.1%
 
95.52 21 0.1%
 
96.26 21 0.1%
 
96.31 20 0.1%
 
97.08 20 0.1%
 
Other values (6886) 23087 92.3%
 
(Missing) 1648 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 12 0.0%
 
0.01 12 0.0%
 
0.02 9 0.0%
 
0.03 5 0.0%
 
0.04 11 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
99.96 5 0.0%
 
99.97 12 0.0%
 
99.98 13 0.1%
 
99.99 31 0.1%
 
100.0 65 0.3%
 

3years_fund_std
Numeric

Distinct count 2194
Unique (%) 8.8%
Missing (%) 6.6%
Missing (n) 1648
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.053
Minimum 0.18
Maximum 50.49
Zeros (%) 0.0%

Quantile statistics

Minimum 0.18
5-th percentile 2.0155
Q1 4.3
Median 9.66
Q3 12.42
95-th percentile 16.5
Maximum 50.49
Range 50.31
Interquartile range 8.12

Descriptive statistics

Standard deviation 5.1263
Coef of variation 0.56625
Kurtosis 2.0319
Mean 9.053
MAD 4.197
Skewness 0.6855
Sum 211400
Variance 26.279
Memory size 195.4 KiB
Value Count Frequency (%)  
2.86 74 0.3%
 
2.88 59 0.2%
 
2.92 54 0.2%
 
2.94 53 0.2%
 
2.85 51 0.2%
 
2.93 51 0.2%
 
10.74 51 0.2%
 
3.02 46 0.2%
 
2.96 45 0.2%
 
3.08 44 0.2%
 
Other values (2183) 22824 91.3%
 
(Missing) 1648 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.18 2 0.0%
 
0.22 4 0.0%
 
0.23 1 0.0%
 
0.24 3 0.0%
 
0.25 6 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
48.88 1 0.0%
 
49.56 1 0.0%
 
49.57 1 0.0%
 
50.44 1 0.0%
 
50.49 1 0.0%
 

3yrs_sharpe_ratio_category
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0071547
Minimum -0.01
Maximum 0.01
Zeros (%) 27.2%

Quantile statistics

Minimum -0.01
5-th percentile 0
Q1 0
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.01
Range 0.02
Interquartile range 0.01

Descriptive statistics

Standard deviation 0.004641
Coef of variation 0.64866
Kurtosis -0.25496
Mean 0.0071547
MAD 0.004105
Skewness -1.1313
Sum 178.11
Variance 2.1539e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 17958 71.8%
 
0.0 6789 27.2%
 
-0.01 147 0.6%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.01 147 0.6%
 
0.0 6789 27.2%
 
0.01 17958 71.8%
 

Maximum 5 values

Value Count Frequency (%)  
-0.01 147 0.6%
 
0.0 6789 27.2%
 
0.01 17958 71.8%
 

3yrs_sharpe_ratio_fund
Numeric

Distinct count 434
Unique (%) 1.7%
Missing (%) 6.6%
Missing (n) 1648
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.68227
Minimum -4.39
Maximum 4.16
Zeros (%) 0.2%

Quantile statistics

Minimum -4.39
5-th percentile -0.08
Q1 0.44
Median 0.74
Q3 0.97
95-th percentile 1.3
Maximum 4.16
Range 8.55
Interquartile range 0.53

Descriptive statistics

Standard deviation 0.4626
Coef of variation 0.67802
Kurtosis 4.7869
Mean 0.68227
MAD 0.34264
Skewness -0.77384
Sum 15932
Variance 0.21399
Memory size 195.4 KiB
Value Count Frequency (%)  
0.92 335 1.3%
 
0.94 329 1.3%
 
0.88 293 1.2%
 
0.84 290 1.2%
 
0.9 287 1.1%
 
0.89 281 1.1%
 
0.96 279 1.1%
 
0.86 279 1.1%
 
0.87 276 1.1%
 
0.98 270 1.1%
 
Other values (423) 20433 81.7%
 
(Missing) 1648 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
-4.39 1 0.0%
 
-3.19 1 0.0%
 
-2.85 1 0.0%
 
-2.81 1 0.0%
 
-2.72 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
3.28 1 0.0%
 
3.75 1 0.0%
 
3.77 1 0.0%
 
3.78 1 0.0%
 
4.16 1 0.0%
 

3yrs_treynor_ratio_category
Numeric

Distinct count 29
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.069803
Minimum -0.76
Maximum 0.3
Zeros (%) 4.3%

Quantile statistics

Minimum -0.76
5-th percentile 0
Q1 0.05
Median 0.06
Q3 0.1
95-th percentile 0.18
Maximum 0.3
Range 1.06
Interquartile range 0.05

Descriptive statistics

Standard deviation 0.068808
Coef of variation 0.98575
Kurtosis 9.5101
Mean 0.069803
MAD 0.042991
Skewness -0.21323
Sum 1737.7
Variance 0.0047346
Memory size 195.4 KiB
Value Count Frequency (%)  
0.06 5117 20.5%
 
0.05 3018 12.1%
 
0.01 2419 9.7%
 
0.08 2225 8.9%
 
0.11 1855 7.4%
 
0.13 1333 5.3%
 
0.09 1303 5.2%
 
0.03 1240 5.0%
 
0.1 1225 4.9%
 
-0.0 1068 4.3%
 
Other values (18) 4091 16.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.76 2 0.0%
 
-0.28 230 0.9%
 
-0.16 51 0.2%
 
-0.05 113 0.5%
 
-0.03 94 0.4%
 

Maximum 5 values

Value Count Frequency (%)  
0.18 15 0.1%
 
0.19 218 0.9%
 
0.25 302 1.2%
 
0.26 50 0.2%
 
0.3 664 2.7%
 

3yrs_treynor_ratio_fund
Categorical

Distinct count 3470
Unique (%) 13.9%
Missing (%) 6.6%
Missing (n) 1648
5.5
 
46
5.96
 
45
6.2
 
45
Other values (3466)
23216
(Missing)
 
1648
Value Count Frequency (%)  
5.5 46 0.2%
 
5.96 45 0.2%
 
6.2 45 0.2%
 
6.1 43 0.2%
 
5.53 41 0.2%
 
6.12 40 0.2%
 
5.72 40 0.2%
 
5.7 40 0.2%
 
5.97 39 0.2%
 
5.87 39 0.2%
 
Other values (3459) 22934 91.7%
 
(Missing) 1648 6.6%
 

category_beta_3years
Numeric

Distinct count 6
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.008782
Minimum -0.01
Maximum 0.03
Zeros (%) 12.8%

Quantile statistics

Minimum -0.01
5-th percentile 0
Q1 0.01
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.03
Range 0.04
Interquartile range 0

Descriptive statistics

Standard deviation 0.0036776
Coef of variation 0.41876
Kurtosis 4.5323
Mean 0.008782
MAD 0.0023377
Skewness -1.5548
Sum 218.62
Variance 1.3525e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 21392 85.6%
 
0.0 3200 12.8%
 
0.02 224 0.9%
 
-0.01 53 0.2%
 
0.03 25 0.1%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.01 53 0.2%
 
0.0 3200 12.8%
 
0.01 21392 85.6%
 
0.02 224 0.9%
 
0.03 25 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
-0.01 53 0.2%
 
0.0 3200 12.8%
 
0.01 21392 85.6%
 
0.02 224 0.9%
 
0.03 25 0.1%
 

fund_beta_3years
Numeric

Distinct count 357
Unique (%) 1.4%
Missing (%) 6.6%
Missing (n) 1648
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.91025
Minimum -39.66
Maximum 22.57
Zeros (%) 0.1%

Quantile statistics

Minimum -39.66
5-th percentile 0.17
Q1 0.77
Median 0.98
Q3 1.14
95-th percentile 1.44
Maximum 22.57
Range 62.23
Interquartile range 0.37

Descriptive statistics

Standard deviation 0.63713
Coef of variation 0.69995
Kurtosis 1505.1
Mean 0.91025
MAD 0.29819
Skewness -20.832
Sum 21256
Variance 0.40593
Memory size 195.4 KiB
Value Count Frequency (%)  
1.0 581 2.3%
 
1.03 524 2.1%
 
0.98 495 2.0%
 
1.02 494 2.0%
 
0.96 425 1.7%
 
1.04 409 1.6%
 
1.08 406 1.6%
 
1.01 377 1.5%
 
1.1 376 1.5%
 
1.06 374 1.5%
 
Other values (346) 18891 75.6%
 
(Missing) 1648 6.6%
 

Minimum 5 values

Value Count Frequency (%)  
-39.66 1 0.0%
 
-39.59 1 0.0%
 
-11.46 1 0.0%
 
-11.43 2 0.0%
 
-6.72 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
11.38 1 0.0%
 
12.8 1 0.0%
 
12.81 1 0.0%
 
12.82 1 0.0%
 
22.57 1 0.0%
 

fund_return_3years
Highly correlated

This variable is highly correlated with 3_years_return_mean_annual_fund and should be ignored for analysis

Correlation 0.99472

tag
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

Correlations

Sample

tag 3yrs_treynor_ratio_fund 3_years_alpha_fund 3years_category_std 3yrs_sharpe_ratio_fund 3yrs_treynor_ratio_category 3_years_return_mean_annual_fund fund_beta_3years 3years_fund_r_squared 3years_fund_std category_beta_3years fund_return_3years 3_years_alpha_category 3_years_return_mean_annual_category 3yrs_sharpe_ratio_category 3years_category_r_squared 3_years_return_category
0 67922 2.46 -7.10 0.18 0.26 0.05 0.45 1.20 54.83 16.25 0.01 4.24 -0.04 0.01 0.00 0.42 7.36
1 134783 12.2 0.07 0.13 1.06 0.13 1.19 1.07 88.46 12.26 0.01 14.39 0.01 0.01 0.01 0.84 15.35
2 61271 17.88 4.32 0.13 1.46 0.13 1.32 0.85 84.41 9.93 0.01 16.42 0.01 0.01 0.01 0.84 15.35
3 64412 7.93 -2.73 0.11 0.68 0.09 0.58 0.70 81.02 8.36 0.01 6.85 -0.02 0.01 0.01 0.84 10.01
4 184058 NaN NaN 0.08 NaN 0.06 NaN NaN NaN NaN 0.01 0.00 -0.01 0.01 0.01 0.97 9.13
In [14]:
#return_5years contains 17 columns which give information about 5 year return and ratios
return_5year = pd.read_csv('Hackathon_Files/external/return_5year.csv')
pandas_profiling.ProfileReport(return_5year)
Out[14]:

Overview

Dataset info

Number of variables 17
Number of observations 25000
Total Missing (%) 6.5%
Total size in memory 3.2 MiB
Average record size in memory 136.0 B

Variables types

Numeric 15
Categorical 1
Boolean 0
Date 0
Text (Unique) 0
Rejected 1
Unsupported 0

Warnings

Variables

5_years_alpha_category
Numeric

Distinct count 19
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean -0.0081208
Minimum -0.18
Maximum 0.08
Zeros (%) 30.0%

Quantile statistics

Minimum -0.18
5-th percentile -0.06
Q1 -0.02
Median 0
Q3 0
95-th percentile 0.04
Maximum 0.08
Range 0.26
Interquartile range 0.02

Descriptive statistics

Standard deviation 0.026415
Coef of variation -3.2527
Kurtosis 6.5381
Mean -0.0081208
MAD 0.017991
Skewness -1.0094
Sum -202.16
Variance 0.00069773
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.0 7500 30.0%
 
-0.01 4837 19.3%
 
-0.03 3107 12.4%
 
0.01 2300 9.2%
 
-0.02 2163 8.7%
 
0.04 1176 4.7%
 
0.02 1030 4.1%
 
-0.06 686 2.7%
 
-0.07 418 1.7%
 
-0.04 404 1.6%
 
Other values (8) 1273 5.1%
 

Minimum 5 values

Value Count Frequency (%)  
-0.18 77 0.3%
 
-0.11 106 0.4%
 
-0.08 93 0.4%
 
-0.07 418 1.7%
 
-0.06 686 2.7%
 

Maximum 5 values

Value Count Frequency (%)  
0.03 303 1.2%
 
0.04 1176 4.7%
 
0.05 30 0.1%
 
0.06 104 0.4%
 
0.08 175 0.7%
 

5_years_alpha_fund
Numeric

Distinct count 2017
Unique (%) 8.1%
Missing (%) 15.4%
Missing (n) 3843
Infinite (%) 0.0%
Infinite (n) 0
Mean -0.83676
Minimum -34.57
Maximum 15.05
Zeros (%) 0.2%

Quantile statistics

Minimum -34.57
5-th percentile -6.39
Q1 -2.12
Median -0.49
Q3 0.7
95-th percentile 3.84
Maximum 15.05
Range 49.62
Interquartile range 2.82

Descriptive statistics

Standard deviation 3.3011
Coef of variation -3.9451
Kurtosis 7.3809
Mean -0.83676
MAD 2.2429
Skewness -1.1349
Sum -17703
Variance 10.897
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.16 69 0.3%
 
-0.12 66 0.3%
 
-0.2 64 0.3%
 
-0.06 62 0.2%
 
0.16 62 0.2%
 
-0.55 61 0.2%
 
-0.18 60 0.2%
 
-0.1 60 0.2%
 
-0.34 60 0.2%
 
-0.38 60 0.2%
 
Other values (2006) 20533 82.1%
 
(Missing) 3843 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
-34.57 1 0.0%
 
-33.59 1 0.0%
 
-30.58 1 0.0%
 
-30.22 1 0.0%
 
-29.84 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
13.32 1 0.0%
 
14.02 1 0.0%
 
14.82 1 0.0%
 
14.96 1 0.0%
 
15.05 1 0.0%
 

5_years_beta_category
Numeric

Distinct count 6
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0086101
Minimum -0.02
Maximum 0.03
Zeros (%) 14.4%

Quantile statistics

Minimum -0.02
5-th percentile 0
Q1 0.01
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.03
Range 0.05
Interquartile range 0

Descriptive statistics

Standard deviation 0.0040121
Coef of variation 0.46598
Kurtosis 7.8208
Mean 0.0086101
MAD 0.0026142
Skewness -1.6648
Sum 214.34
Variance 1.6097e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 20989 84.0%
 
0.0 3603 14.4%
 
0.02 196 0.8%
 
-0.02 53 0.2%
 
0.03 53 0.2%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 3603 14.4%
 
0.01 20989 84.0%
 
0.02 196 0.8%
 
0.03 53 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 3603 14.4%
 
0.01 20989 84.0%
 
0.02 196 0.8%
 
0.03 53 0.2%
 

5_years_beta_fund
Numeric

Distinct count 340
Unique (%) 1.4%
Missing (%) 15.4%
Missing (n) 3843
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.89786
Minimum -38.85
Maximum 24.72
Zeros (%) 0.1%

Quantile statistics

Minimum -38.85
5-th percentile 0.18
Q1 0.77
Median 0.97
Q3 1.1
95-th percentile 1.4
Maximum 24.72
Range 63.57
Interquartile range 0.33

Descriptive statistics

Standard deviation 0.6422
Coef of variation 0.71526
Kurtosis 1514.9
Mean 0.89786
MAD 0.27947
Skewness -19.761
Sum 18996
Variance 0.41242
Memory size 195.4 KiB
Value Count Frequency (%)  
1.0 599 2.4%
 
1.02 517 2.1%
 
1.06 478 1.9%
 
1.04 437 1.7%
 
0.92 429 1.7%
 
0.99 424 1.7%
 
0.96 423 1.7%
 
0.94 419 1.7%
 
1.08 405 1.6%
 
1.01 403 1.6%
 
Other values (329) 16623 66.5%
 
(Missing) 3843 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
-38.85 1 0.0%
 
-38.77 1 0.0%
 
-9.98 1 0.0%
 
-9.92 1 0.0%
 
-9.89 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
11.68 1 0.0%
 
11.7 1 0.0%
 
11.73 1 0.0%
 
15.21 1 0.0%
 
24.72 1 0.0%
 

5_years_return_category
Numeric

Distinct count 100
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.7573
Minimum -17
Maximum 15.26
Zeros (%) 0.8%

Quantile statistics

Minimum -17
5-th percentile 0.87
Q1 2.61
Median 4.23
Q3 6.41
95-th percentile 11.26
Maximum 15.26
Range 32.26
Interquartile range 3.8

Descriptive statistics

Standard deviation 3.4094
Coef of variation 0.71666
Kurtosis 4.9594
Mean 4.7573
MAD 2.5855
Skewness -0.63204
Sum 118390
Variance 11.624
Memory size 195.4 KiB
Value Count Frequency (%)  
11.26 1333 5.3%
 
8.91 1270 5.1%
 
7.2 1121 4.5%
 
2.51 957 3.8%
 
5.89 926 3.7%
 
3.45 854 3.4%
 
2.61 757 3.0%
 
5.12 708 2.8%
 
5.62 683 2.7%
 
2.1 680 2.7%
 
Other values (89) 15596 62.4%
 

Minimum 5 values

Value Count Frequency (%)  
-17.0 53 0.2%
 
-11.01 75 0.3%
 
-8.35 109 0.4%
 
-4.25 106 0.4%
 
-2.08 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
8.99 578 2.3%
 
9.53 199 0.8%
 
10.02 101 0.4%
 
11.26 1333 5.3%
 
15.26 160 0.6%
 

5_years_return_fund
Highly correlated

This variable is highly correlated with 5_years_return_mean_annual_fund and should be ignored for analysis

Correlation 0.98935

5_years_return_mean_annual_category
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0038949
Minimum -0.01
Maximum 0.01
Zeros (%) 58.9%

Quantile statistics

Minimum -0.01
5-th percentile 0
Q1 0
Median 0
Q3 0.01
95-th percentile 0.01
Maximum 0.01
Range 0.02
Interquartile range 0.01

Descriptive statistics

Standard deviation 0.0050687
Coef of variation 1.3014
Kurtosis -1.414
Mean 0.0038949
MAD 0.0048725
Skewness 0.23203
Sum 96.96
Variance 2.5692e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.0 14722 58.9%
 
0.01 9934 39.7%
 
-0.01 238 1.0%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.01 238 1.0%
 
-0.0 14722 58.9%
 
0.01 9934 39.7%
 

Maximum 5 values

Value Count Frequency (%)  
-0.01 238 1.0%
 
-0.0 14722 58.9%
 
0.01 9934 39.7%
 

5_years_return_mean_annual_fund
Numeric

Distinct count 329
Unique (%) 1.3%
Missing (%) 15.4%
Missing (n) 3843
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.43748
Minimum -2.96
Maximum 2.49
Zeros (%) 0.1%

Quantile statistics

Minimum -2.96
5-th percentile 0.03
Q1 0.23
Median 0.41
Q3 0.63
95-th percentile 1.01
Maximum 2.49
Range 5.45
Interquartile range 0.4

Descriptive statistics

Standard deviation 0.34125
Coef of variation 0.78003
Kurtosis 7.5068
Mean 0.43748
MAD 0.25249
Skewness -0.62923
Sum 9255.7
Variance 0.11645
Memory size 195.4 KiB
Value Count Frequency (%)  
0.23 348 1.4%
 
0.24 348 1.4%
 
0.3 341 1.4%
 
0.22 334 1.3%
 
0.29 334 1.3%
 
0.25 329 1.3%
 
0.27 328 1.3%
 
0.21 325 1.3%
 
0.28 322 1.3%
 
0.34 320 1.3%
 
Other values (318) 17828 71.3%
 
(Missing) 3843 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
-2.96 1 0.0%
 
-2.9 1 0.0%
 
-2.88 1 0.0%
 
-2.86 2 0.0%
 
-2.85 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
2.36 1 0.0%
 
2.38 1 0.0%
 
2.44 1 0.0%
 
2.45 2 0.0%
 
2.49 1 0.0%
 

5years_category_std
Numeric

Distinct count 26
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.093149
Minimum 0
Maximum 0.36
Zeros (%) 0.7%

Quantile statistics

Minimum 0
5-th percentile 0.02
Q1 0.05
Median 0.1
Q3 0.13
95-th percentile 0.16
Maximum 0.36
Range 0.36
Interquartile range 0.08

Descriptive statistics

Standard deviation 0.0496
Coef of variation 0.53248
Kurtosis 1.0779
Mean 0.093149
MAD 0.041488
Skewness 0.36764
Sum 2318.9
Variance 0.0024602
Memory size 195.4 KiB
Value Count Frequency (%)  
0.03 3226 12.9%
 
0.12 3134 12.5%
 
0.13 2586 10.3%
 
0.11 2248 9.0%
 
0.05 1766 7.1%
 
0.08 1648 6.6%
 
0.15 1444 5.8%
 
0.09 1298 5.2%
 
0.04 980 3.9%
 
0.01 960 3.8%
 
Other values (15) 5604 22.4%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 187 0.7%
 
0.01 960 3.8%
 
0.02 265 1.1%
 
0.03 3226 12.9%
 
0.04 980 3.9%
 

Maximum 5 values

Value Count Frequency (%)  
0.21 54 0.2%
 
0.22 53 0.2%
 
0.26 77 0.3%
 
0.28 15 0.1%
 
0.36 57 0.2%
 

5years_fund_r_squared
Numeric

Distinct count 6488
Unique (%) 26.0%
Missing (%) 15.4%
Missing (n) 3843
Infinite (%) 0.0%
Infinite (n) 0
Mean 72.453
Minimum 0
Maximum 100
Zeros (%) 0.1%

Quantile statistics

Minimum 0
5-th percentile 3.23
Q1 64.26
Median 82.36
Q3 92.52
95-th percentile 97.35
Maximum 100
Range 100
Interquartile range 28.26

Descriptive statistics

Standard deviation 27.494
Coef of variation 0.37948
Kurtosis 0.96791
Mean 72.453
MAD 21.093
Skewness -1.4073
Sum 1532900
Variance 755.95
Memory size 195.4 KiB
Value Count Frequency (%)  
100.0 59 0.2%
 
99.99 39 0.2%
 
96.82 28 0.1%
 
95.38 26 0.1%
 
95.44 23 0.1%
 
95.43 21 0.1%
 
95.4 20 0.1%
 
95.16 20 0.1%
 
95.06 19 0.1%
 
0.0 19 0.1%
 
Other values (6477) 20883 83.5%
 
(Missing) 3843 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 19 0.1%
 
0.01 16 0.1%
 
0.02 18 0.1%
 
0.03 6 0.0%
 
0.04 14 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
99.95 3 0.0%
 
99.97 5 0.0%
 
99.98 15 0.1%
 
99.99 39 0.2%
 
100.0 59 0.2%
 

5years_fund_std
Numeric

Distinct count 2179
Unique (%) 8.7%
Missing (%) 15.4%
Missing (n) 3843
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.4574
Minimum 0.17
Maximum 56.67
Zeros (%) 0.0%

Quantile statistics

Minimum 0.17
5-th percentile 2.068
Q1 4.67
Median 10.34
Q3 12.83
95-th percentile 16.77
Maximum 56.67
Range 56.5
Interquartile range 8.16

Descriptive statistics

Standard deviation 5.3224
Coef of variation 0.56278
Kurtosis 2.7525
Mean 9.4574
MAD 4.3377
Skewness 0.71357
Sum 200090
Variance 28.328
Memory size 195.4 KiB
Value Count Frequency (%)  
2.74 71 0.3%
 
11.18 60 0.2%
 
2.84 57 0.2%
 
2.66 54 0.2%
 
2.83 54 0.2%
 
2.72 52 0.2%
 
2.76 50 0.2%
 
11.5 48 0.2%
 
2.8 47 0.2%
 
2.86 47 0.2%
 
Other values (2168) 20617 82.5%
 
(Missing) 3843 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
0.17 1 0.0%
 
0.2 1 0.0%
 
0.25 1 0.0%
 
0.26 3 0.0%
 
0.27 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
50.79 1 0.0%
 
52.97 1 0.0%
 
53.07 1 0.0%
 
56.61 1 0.0%
 
56.67 1 0.0%
 

5yrs_sharpe_ratio_category
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0059733
Minimum -0.01
Maximum 0.01
Zeros (%) 38.8%

Quantile statistics

Minimum -0.01
5-th percentile 0
Q1 0
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.01
Range 0.02
Interquartile range 0.01

Descriptive statistics

Standard deviation 0.0050346
Coef of variation 0.84285
Kurtosis -1.3252
Mean 0.0059733
MAD 0.0048626
Skewness -0.54861
Sum 148.7
Variance 2.5347e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 15031 60.1%
 
-0.0 9702 38.8%
 
-0.01 161 0.6%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.01 161 0.6%
 
-0.0 9702 38.8%
 
0.01 15031 60.1%
 

Maximum 5 values

Value Count Frequency (%)  
-0.01 161 0.6%
 
-0.0 9702 38.8%
 
0.01 15031 60.1%
 

5yrs_sharpe_ratio_fund
Numeric

Distinct count 345
Unique (%) 1.4%
Missing (%) 15.4%
Missing (n) 3843
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.51784
Minimum -5.14
Maximum 3.22
Zeros (%) 0.3%

Quantile statistics

Minimum -5.14
5-th percentile -0.08
Q1 0.33
Median 0.55
Q3 0.73
95-th percentile 1.02
Maximum 3.22
Range 8.36
Interquartile range 0.4

Descriptive statistics

Standard deviation 0.36255
Coef of variation 0.70012
Kurtosis 7.0212
Mean 0.51784
MAD 0.26252
Skewness -0.81959
Sum 10956
Variance 0.13144
Memory size 195.4 KiB
Value Count Frequency (%)  
0.6 382 1.5%
 
0.56 373 1.5%
 
0.58 364 1.5%
 
0.52 348 1.4%
 
0.66 341 1.4%
 
0.62 340 1.4%
 
0.54 338 1.4%
 
0.57 327 1.3%
 
0.7 316 1.3%
 
0.64 311 1.2%
 
Other values (334) 17717 70.9%
 
(Missing) 3843 15.4%
 

Minimum 5 values

Value Count Frequency (%)  
-5.14 1 0.0%
 
-2.64 1 0.0%
 
-2.56 1 0.0%
 
-2.44 1 0.0%
 
-1.84 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
2.66 1 0.0%
 
2.9 1 0.0%
 
2.98 1 0.0%
 
3.02 1 0.0%
 
3.22 2 0.0%
 

5yrs_treynor_ratio_category
Numeric

Distinct count 25
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.040969
Minimum -0.16
Maximum 0.32
Zeros (%) 4.5%

Quantile statistics

Minimum -0.16
5-th percentile -0.01
Q1 0.02
Median 0.04
Q3 0.07
95-th percentile 0.1
Maximum 0.32
Range 0.48
Interquartile range 0.05

Descriptive statistics

Standard deviation 0.044379
Coef of variation 1.0832
Kurtosis 7.1312
Mean 0.040969
MAD 0.0288
Skewness -0.46642
Sum 1019.9
Variance 0.0019695
Memory size 195.4 KiB
Value Count Frequency (%)  
0.04 4525 18.1%
 
0.02 3590 14.4%
 
0.03 3051 12.2%
 
0.07 2270 9.1%
 
0.05 2125 8.5%
 
0.06 1644 6.6%
 
0.1 1333 5.3%
 
0.01 1318 5.3%
 
0.08 1273 5.1%
 
-0.0 1125 4.5%
 
Other values (14) 2640 10.6%
 

Minimum 5 values

Value Count Frequency (%)  
-0.16 2 0.0%
 
-0.13 410 1.6%
 
-0.1 230 0.9%
 
-0.09 142 0.6%
 
-0.08 77 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
0.1 1333 5.3%
 
0.11 677 2.7%
 
0.12 454 1.8%
 
0.25 50 0.2%
 
0.32 51 0.2%
 

5yrs_treynor_ratio_fund
Categorical

Distinct count 2834
Unique (%) 11.3%
Missing (%) 15.4%
Missing (n) 3843
3.56
 
46
3.8
 
45
3.84
 
45
Other values (2830)
21021
(Missing)
 
3843
Value Count Frequency (%)  
3.56 46 0.2%
 
3.8 45 0.2%
 
3.84 45 0.2%
 
3.64 43 0.2%
 
2.8 41 0.2%
 
3.32 41 0.2%
 
4.18 41 0.2%
 
3.86 40 0.2%
 
2.92 40 0.2%
 
2.48 39 0.2%
 
Other values (2823) 20736 82.9%
 
(Missing) 3843 15.4%
 

category_r_squared_5years
Numeric

Distinct count 60
Unique (%) 0.2%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.71275
Minimum 0
Maximum 0.97
Zeros (%) 0.9%

Quantile statistics

Minimum 0
5-th percentile 0.04
Q1 0.64
Median 0.83
Q3 0.89
95-th percentile 0.95
Maximum 0.97
Range 0.97
Interquartile range 0.25

Descriptive statistics

Standard deviation 0.26011
Coef of variation 0.36494
Kurtosis 1.0817
Mean 0.71275
MAD 0.20099
Skewness -1.4307
Sum 17743
Variance 0.067658
Memory size 195.4 KiB
Value Count Frequency (%)  
0.86 2480 9.9%
 
0.89 1684 6.7%
 
0.78 1431 5.7%
 
0.84 1423 5.7%
 
0.93 1413 5.7%
 
0.65 1353 5.4%
 
0.94 1027 4.1%
 
0.96 957 3.8%
 
0.95 898 3.6%
 
0.52 799 3.2%
 
Other values (49) 11429 45.7%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 215 0.9%
 
0.01 230 0.9%
 
0.03 715 2.9%
 
0.04 144 0.6%
 
0.06 57 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
0.93 1413 5.7%
 
0.94 1027 4.1%
 
0.95 898 3.6%
 
0.96 957 3.8%
 
0.97 247 1.0%
 

tag
Numeric

Distinct count 25000
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 139880
Minimum 26000
Maximum 253763
Zeros (%) 0.0%

Quantile statistics

Minimum 26000
5-th percentile 37367
Q1 83022
Median 139880
Q3 196760
95-th percentile 242390
Maximum 253763
Range 227763
Interquartile range 113740

Descriptive statistics

Standard deviation 65731
Coef of variation 0.46992
Kurtosis -1.199
Mean 139880
MAD 56921
Skewness 6.0424e-05
Sum 3496973366
Variance 4320600000
Memory size 195.4 KiB
Value Count Frequency (%)  
165887 1 0.0%
 
193211 1 0.0%
 
86687 1 0.0%
 
174752 1 0.0%
 
41633 1 0.0%
 
144035 1 0.0%
 
232100 1 0.0%
 
98981 1 0.0%
 
39590 1 0.0%
 
201383 1 0.0%
 
Other values (24990) 24990 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
26000 1 0.0%
 
26009 1 0.0%
 
26018 1 0.0%
 
26027 1 0.0%
 
26036 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
253727 1 0.0%
 
253736 1 0.0%
 
253745 1 0.0%
 
253754 1 0.0%
 
253763 1 0.0%
 

Correlations

Sample

category_r_squared_5years 5yrs_sharpe_ratio_fund 5_years_alpha_fund 5years_fund_r_squared 5years_fund_std 5yrs_sharpe_ratio_category 5_years_beta_fund 5yrs_treynor_ratio_fund 5_years_return_mean_annual_fund 5_years_return_mean_annual_category 5yrs_treynor_ratio_category 5_years_return_fund 5_years_alpha_category 5_years_beta_category 5years_category_std tag 5_years_return_category
0 0.51 NaN NaN NaN NaN -0.00 NaN NaN NaN -0.00 -0.04 0.00 -0.11 0.01 0.20 67922 -4.25
1 0.86 0.89 0.34 90.11 12.40 0.01 1.05 10.37 0.99 0.01 0.10 11.71 -0.00 0.01 0.13 134783 11.26
2 0.86 1.15 2.96 89.02 10.28 0.01 0.86 13.84 1.05 0.01 0.10 12.78 -0.00 0.01 0.13 61271 11.26
3 0.86 0.77 -0.50 82.36 8.53 0.01 0.69 9.3 0.62 0.01 0.07 7.25 -0.03 0.01 0.11 64412 7.20
4 0.96 NaN NaN NaN NaN 0.01 NaN NaN NaN 0.01 0.04 0.00 -0.01 0.01 0.09 184058 5.95
In [15]:
#return_10years contains 17 columns which give information about 10 year return and ratios
return_10year = pd.read_csv('Hackathon_Files/external/return_10year.csv')
pandas_profiling.ProfileReport(return_10year)
Out[15]:

Overview

Dataset info

Number of variables 17
Number of observations 25000
Total Missing (%) 12.3%
Total size in memory 3.2 MiB
Average record size in memory 136.0 B

Variables types

Numeric 14
Categorical 1
Boolean 0
Date 0
Text (Unique) 1
Rejected 1
Unsupported 0

Warnings

Variables

10_years_alpha_category
Numeric

Distinct count 18
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0011356
Minimum -0.11
Maximum 0.1
Zeros (%) 19.6%

Quantile statistics

Minimum -0.11
5-th percentile -0.03
Q1 -0.02
Median -0
Q3 0.01
95-th percentile 0.06
Maximum 0.1
Range 0.21
Interquartile range 0.03

Descriptive statistics

Standard deviation 0.027795
Coef of variation 24.476
Kurtosis 1.7129
Mean 0.0011356
MAD 0.02014
Skewness 0.91709
Sum 28.27
Variance 0.00077256
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.01 5346 21.4%
 
0.0 4900 19.6%
 
-0.02 4605 18.4%
 
0.01 2769 11.1%
 
0.04 1946 7.8%
 
-0.03 1503 6.0%
 
0.02 981 3.9%
 
0.08 664 2.7%
 
-0.04 546 2.2%
 
0.06 509 2.0%
 
Other values (7) 1125 4.5%
 

Minimum 5 values

Value Count Frequency (%)  
-0.11 77 0.3%
 
-0.06 150 0.6%
 
-0.05 25 0.1%
 
-0.04 546 2.2%
 
-0.03 1503 6.0%
 

Maximum 5 values

Value Count Frequency (%)  
0.05 193 0.8%
 
0.06 509 2.0%
 
0.07 428 1.7%
 
0.08 664 2.7%
 
0.1 50 0.2%
 

10_years_alpha_fund
Numeric

Distinct count 1810
Unique (%) 7.2%
Missing (%) 34.3%
Missing (n) 8584
Infinite (%) 0.0%
Infinite (n) 0
Mean -0.0031475
Minimum -25.97
Maximum 14.86
Zeros (%) 0.1%

Quantile statistics

Minimum -25.97
5-th percentile -4.3625
Q1 -1.74
Median -0.3
Q3 1.28
95-th percentile 6.66
Maximum 14.86
Range 40.83
Interquartile range 3.02

Descriptive statistics

Standard deviation 3.2756
Coef of variation -1040.7
Kurtosis 3.4222
Mean -0.0031475
MAD 2.299
Skewness 0.10365
Sum -51.67
Variance 10.729
Memory size 195.4 KiB
Value Count Frequency (%)  
-0.55 51 0.2%
 
-0.16 49 0.2%
 
-0.18 48 0.2%
 
-0.32 47 0.2%
 
-0.3 47 0.2%
 
-0.82 45 0.2%
 
-0.58 45 0.2%
 
-0.56 44 0.2%
 
-0.08 44 0.2%
 
-0.46 44 0.2%
 
Other values (1799) 15952 63.8%
 
(Missing) 8584 34.3%
 

Minimum 5 values

Value Count Frequency (%)  
-25.97 1 0.0%
 
-25.02 1 0.0%
 
-24.7 1 0.0%
 
-23.71 1 0.0%
 
-22.16 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
13.48 1 0.0%
 
13.74 1 0.0%
 
13.89 1 0.0%
 
14.52 1 0.0%
 
14.86 1 0.0%
 

10_years_beta_category
Numeric

Distinct count 7
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0090757
Minimum -0.02
Maximum 0.12
Zeros (%) 12.5%

Quantile statistics

Minimum -0.02
5-th percentile 0
Q1 0.01
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.12
Range 0.14
Interquartile range 0

Descriptive statistics

Standard deviation 0.0054928
Coef of variation 0.60522
Kurtosis 187.34
Mean 0.0090757
MAD 0.0024053
Skewness 8.64
Sum 225.93
Variance 3.017e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 21030 84.1%
 
0.0 3129 12.5%
 
0.02 629 2.5%
 
-0.02 53 0.2%
 
0.12 28 0.1%
 
0.03 25 0.1%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 3129 12.5%
 
0.01 21030 84.1%
 
0.02 629 2.5%
 
0.03 25 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 3129 12.5%
 
0.01 21030 84.1%
 
0.02 629 2.5%
 
0.03 25 0.1%
 
0.12 28 0.1%
 

10_years_beta_fund
Numeric

Distinct count 305
Unique (%) 1.2%
Missing (%) 34.3%
Missing (n) 8584
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.96322
Minimum -88.06
Maximum 49.29
Zeros (%) 0.1%

Quantile statistics

Minimum -88.06
5-th percentile 0.31
Q1 0.86
Median 1.01
Q3 1.13
95-th percentile 1.42
Maximum 49.29
Range 137.35
Interquartile range 0.27

Descriptive statistics

Standard deviation 1.5826
Coef of variation 1.6431
Kurtosis 1624.7
Mean 0.96322
MAD 0.28399
Skewness -19.289
Sum 15812
Variance 2.5047
Memory size 195.4 KiB
Value Count Frequency (%)  
1.0 567 2.3%
 
1.06 483 1.9%
 
1.01 439 1.8%
 
1.04 403 1.6%
 
1.02 383 1.5%
 
1.08 355 1.4%
 
0.96 345 1.4%
 
1.05 342 1.4%
 
0.99 326 1.3%
 
0.98 324 1.3%
 
Other values (294) 12449 49.8%
 
(Missing) 8584 34.3%
 

Minimum 5 values

Value Count Frequency (%)  
-88.06 1 0.0%
 
-87.88 1 0.0%
 
-48.87 1 0.0%
 
-48.76 1 0.0%
 
-48.75 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
38.24 2 0.0%
 
38.25 1 0.0%
 
49.18 1 0.0%
 
49.21 1 0.0%
 
49.29 1 0.0%
 

10_years_return_category
Numeric

Distinct count 100
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.6793
Minimum -24.99
Maximum 18.72
Zeros (%) 1.5%

Quantile statistics

Minimum -24.99
5-th percentile 1.72
Q1 6.44
Median 9.97
Q3 14.12
95-th percentile 15.94
Maximum 18.72
Range 43.71
Interquartile range 7.68

Descriptive statistics

Standard deviation 4.9264
Coef of variation 0.50896
Kurtosis 4.0661
Mean 9.6793
MAD 3.9918
Skewness -0.9968
Sum 240870
Variance 24.27
Memory size 195.4 KiB
Value Count Frequency (%)  
15.94 1333 5.3%
 
14.54 1270 5.1%
 
13.68 1121 4.5%
 
4.56 957 3.8%
 
11.79 938 3.8%
 
8.92 757 3.0%
 
9.98 708 2.8%
 
14.67 683 2.7%
 
8.43 680 2.7%
 
16.24 669 2.7%
 
Other values (89) 15769 63.1%
 

Minimum 5 values

Value Count Frequency (%)  
-24.99 53 0.2%
 
-3.06 109 0.4%
 
-2.95 57 0.2%
 
-2.76 114 0.5%
 
0.0 387 1.5%
 

Maximum 5 values

Value Count Frequency (%)  
16.24 669 2.7%
 
17.07 101 0.4%
 
17.16 27 0.1%
 
17.24 225 0.9%
 
18.72 160 0.6%
 

10_years_return_fund
Numeric

Distinct count 2275
Unique (%) 9.1%
Missing (%) 0.5%
Missing (n) 115
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.621
Minimum -38.56
Maximum 40.66
Zeros (%) 33.9%

Quantile statistics

Minimum -38.56
5-th percentile 0
Q1 0
Median 5.9
Q3 12.38
95-th percentile 16.66
Maximum 40.66
Range 79.22
Interquartile range 12.38

Descriptive statistics

Standard deviation 6.5374
Coef of variation 0.98738
Kurtosis 0.78071
Mean 6.621
MAD 5.7164
Skewness -0.018269
Sum 164760
Variance 42.738
Memory size 195.4 KiB
Value Count Frequency (%)  
0.0 8475 33.9%
 
15.14 24 0.1%
 
15.42 24 0.1%
 
9.24 23 0.1%
 
13.11 21 0.1%
 
13.26 21 0.1%
 
13.63 20 0.1%
 
14.05 20 0.1%
 
12.41 20 0.1%
 
10.15 20 0.1%
 
Other values (2264) 16217 64.9%
 
(Missing) 115 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
-38.56 1 0.0%
 
-38.21 1 0.0%
 
-37.94 1 0.0%
 
-37.78 1 0.0%
 
-37.77 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
36.45 1 0.0%
 
36.86 1 0.0%
 
37.81 1 0.0%
 
37.92 2 0.0%
 
40.66 1 0.0%
 

10_years_return_mean_annual_category
Numeric

Distinct count 5
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.007681
Minimum -0.02
Maximum 0.02
Zeros (%) 23.1%

Quantile statistics

Minimum -0.02
5-th percentile 0
Q1 0.01
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.02
Range 0.04
Interquartile range 0

Descriptive statistics

Standard deviation 0.004514
Coef of variation 0.58769
Kurtosis 2.3657
Mean 0.007681
MAD 0.003681
Skewness -1.4002
Sum 191.21
Variance 2.0376e-05
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 18907 75.6%
 
0.0 5774 23.1%
 
0.02 160 0.6%
 
-0.02 53 0.2%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 5774 23.1%
 
0.01 18907 75.6%
 
0.02 160 0.6%
 

Maximum 5 values

Value Count Frequency (%)  
-0.02 53 0.2%
 
0.0 5774 23.1%
 
0.01 18907 75.6%
 
0.02 160 0.6%
 

10_years_return_mean_annual_fund
Highly correlated

This variable is highly correlated with 10_years_return_fund and should be ignored for analysis

Correlation 0.99243

10years_category_r_squared
Numeric

Distinct count 53
Unique (%) 0.2%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.73158
Minimum 0
Maximum 0.97
Zeros (%) 2.5%

Quantile statistics

Minimum 0
5-th percentile 0.03
Q1 0.71
Median 0.84
Q3 0.92
95-th percentile 0.95
Maximum 0.97
Range 0.97
Interquartile range 0.21

Descriptive statistics

Standard deviation 0.27368
Coef of variation 0.3741
Kurtosis 1.3046
Mean 0.73158
MAD 0.20389
Skewness -1.5742
Sum 18212
Variance 0.074903
Memory size 195.4 KiB
Value Count Frequency (%)  
0.94 2320 9.3%
 
0.88 1485 5.9%
 
0.82 1448 5.8%
 
0.91 1333 5.3%
 
0.73 1314 5.3%
 
0.75 1277 5.1%
 
0.9 1173 4.7%
 
0.92 1105 4.4%
 
0.93 1037 4.1%
 
0.84 985 3.9%
 
Other values (42) 11417 45.7%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 616 2.5%
 
0.01 193 0.8%
 
0.03 664 2.7%
 
0.06 78 0.3%
 
0.07 302 1.2%
 

Maximum 5 values

Value Count Frequency (%)  
0.93 1037 4.1%
 
0.94 2320 9.3%
 
0.95 938 3.8%
 
0.96 183 0.7%
 
0.97 875 3.5%
 

10years_category_std
Numeric

Distinct count 27
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.1094
Minimum 0
Maximum 0.34
Zeros (%) 1.5%

Quantile statistics

Minimum 0
5-th percentile 0.02
Q1 0.06
Median 0.12
Q3 0.15
95-th percentile 0.18
Maximum 0.34
Range 0.34
Interquartile range 0.09

Descriptive statistics

Standard deviation 0.055202
Coef of variation 0.5046
Kurtosis -0.38714
Mean 0.1094
MAD 0.047445
Skewness -0.017667
Sum 2723.4
Variance 0.0030473
Memory size 195.4 KiB
Value Count Frequency (%)  
0.13 2980 11.9%
 
0.16 2473 9.9%
 
0.14 2237 8.9%
 
0.07 1738 7.0%
 
0.04 1724 6.9%
 
0.18 1623 6.5%
 
0.03 1338 5.4%
 
0.1 1281 5.1%
 
0.05 1279 5.1%
 
0.09 1074 4.3%
 
Other values (16) 7147 28.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 386 1.5%
 
0.01 449 1.8%
 
0.02 550 2.2%
 
0.03 1338 5.4%
 
0.04 1724 6.9%
 

Maximum 5 values

Value Count Frequency (%)  
0.22 2 0.0%
 
0.24 53 0.2%
 
0.25 15 0.1%
 
0.26 92 0.4%
 
0.34 57 0.2%
 

10years_fund_r_squared
Numeric

Distinct count 5185
Unique (%) 20.7%
Missing (%) 34.3%
Missing (n) 8584
Infinite (%) 0.0%
Infinite (n) 0
Mean 76.603
Minimum 0
Maximum 100
Zeros (%) 0.1%

Quantile statistics

Minimum 0
5-th percentile 5.8875
Q1 72.52
Median 86.08
Q3 93.77
95-th percentile 97.51
Maximum 100
Range 100
Interquartile range 21.25

Descriptive statistics

Standard deviation 25.699
Coef of variation 0.33549
Kurtosis 2.1707
Mean 76.603
MAD 18.66
Skewness -1.7549
Sum 1257500
Variance 660.45
Memory size 195.4 KiB
Value Count Frequency (%)  
99.99 58 0.2%
 
100.0 43 0.2%
 
0.0 31 0.1%
 
0.02 28 0.1%
 
96.98 26 0.1%
 
0.01 21 0.1%
 
93.74 20 0.1%
 
96.13 20 0.1%
 
95.31 20 0.1%
 
95.68 19 0.1%
 
Other values (5174) 16130 64.5%
 
(Missing) 8584 34.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 31 0.1%
 
0.01 21 0.1%
 
0.02 28 0.1%
 
0.03 6 0.0%
 
0.04 7 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
99.92 3 0.0%
 
99.97 5 0.0%
 
99.98 8 0.0%
 
99.99 58 0.2%
 
100.0 43 0.2%
 

10years_fund_std
Numeric

Distinct count 2255
Unique (%) 9.0%
Missing (%) 34.3%
Missing (n) 8584
Infinite (%) 0.0%
Infinite (n) 0
Mean 11.419
Minimum 0.2
Maximum 52.29
Zeros (%) 0.0%

Quantile statistics

Minimum 0.2
5-th percentile 2.46
Q1 6.14
Median 12.74
Q3 15.62
95-th percentile 19.1
Maximum 52.29
Range 52.09
Interquartile range 9.48

Descriptive statistics

Standard deviation 5.9371
Coef of variation 0.51995
Kurtosis 1.0757
Mean 11.419
MAD 4.9435
Skewness 0.31377
Sum 187450
Variance 35.25
Memory size 195.4 KiB
Value Count Frequency (%)  
12.69 42 0.2%
 
12.68 41 0.2%
 
3.36 39 0.2%
 
3.18 35 0.1%
 
15.22 34 0.1%
 
12.7 33 0.1%
 
14.07 31 0.1%
 
13.02 30 0.1%
 
13.87 29 0.1%
 
14.08 29 0.1%
 
Other values (2244) 16073 64.3%
 
(Missing) 8584 34.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.2 1 0.0%
 
0.22 2 0.0%
 
0.25 1 0.0%
 
0.27 2 0.0%
 
0.3 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
50.17 1 0.0%
 
51.58 1 0.0%
 
51.63 1 0.0%
 
52.18 1 0.0%
 
52.29 1 0.0%
 

10yrs_sharpe_ratio_category
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.0095107
Minimum -0.01
Maximum 0.01
Zeros (%) 4.4%

Quantile statistics

Minimum -0.01
5-th percentile 0.01
Q1 0.01
Median 0.01
Q3 0.01
95-th percentile 0.01
Maximum 0.01
Range 0.02
Interquartile range 0

Descriptive statistics

Standard deviation 0.0022537
Coef of variation 0.23697
Kurtosis 23.133
Mean 0.0095107
MAD 0.00093275
Skewness -4.729
Sum 236.76
Variance 5.0794e-06
Memory size 195.4 KiB
Value Count Frequency (%)  
0.01 23729 94.9%
 
0.0 1112 4.4%
 
-0.01 53 0.2%
 
(Missing) 106 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
-0.01 53 0.2%
 
0.0 1112 4.4%
 
0.01 23729 94.9%
 

Maximum 5 values

Value Count Frequency (%)  
-0.01 53 0.2%
 
0.0 1112 4.4%
 
0.01 23729 94.9%
 

10yrs_sharpe_ratio_fund
Numeric

Distinct count 322
Unique (%) 1.3%
Missing (%) 34.3%
Missing (n) 8584
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.93749
Minimum -6.58
Maximum 3.01
Zeros (%) 0.0%

Quantile statistics

Minimum -6.58
5-th percentile 0.43
Q1 0.8
Median 0.96
Q3 1.12
95-th percentile 1.41
Maximum 3.01
Range 9.59
Interquartile range 0.32

Descriptive statistics

Standard deviation 0.34227
Coef of variation 0.36509
Kurtosis 23.314
Mean 0.93749
MAD 0.22893
Skewness -2.1068
Sum 15390
Variance 0.11715
Memory size 195.4 KiB
Value Count Frequency (%)  
0.98 372 1.5%
 
0.96 365 1.5%
 
0.92 360 1.4%
 
0.94 350 1.4%
 
0.97 307 1.2%
 
0.9 307 1.2%
 
0.91 300 1.2%
 
1.04 300 1.2%
 
1.0 299 1.2%
 
0.88 294 1.2%
 
Other values (311) 13162 52.6%
 
(Missing) 8584 34.3%
 

Minimum 5 values

Value Count Frequency (%)  
-6.58 1 0.0%
 
-2.01 1 0.0%
 
-1.88 1 0.0%
 
-1.85 1 0.0%
 
-1.76 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
2.44 1 0.0%
 
2.46 1 0.0%
 
2.78 1 0.0%
 
2.89 1 0.0%
 
3.01 1 0.0%
 

10yrs_treynor_ratio_category
Numeric

Distinct count 30
Unique (%) 0.1%
Missing (%) 0.4%
Missing (n) 106
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.13884
Minimum -0.19
Maximum 4.68
Zeros (%) 1.7%

Quantile statistics

Minimum -0.19
5-th percentile 0
Q1 0.07
Median 0.1
Q3 0.14
95-th percentile 0.21
Maximum 4.68
Range 4.87
Interquartile range 0.07

Descriptive statistics

Standard deviation 0.44255
Coef of variation 3.1875
Kurtosis 99.465
Mean 0.13884
MAD 0.09513
Skewness 9.9758
Sum 3456.2
Variance 0.19585
Memory size 195.4 KiB
Value Count Frequency (%)  
0.14 4428 17.7%
 
0.08 3807 15.2%
 
0.04 2092 8.4%
 
0.09 1643 6.6%
 
0.15 1445 5.8%
 
0.1 1368 5.5%
 
0.13 1349 5.4%
 
0.05 1341 5.4%
 
0.12 1247 5.0%
 
0.11 967 3.9%
 
Other values (19) 5207 20.8%
 

Minimum 5 values

Value Count Frequency (%)  
-0.19 51 0.2%
 
-0.14 193 0.8%
 
-0.1 142 0.6%
 
-0.05 278 1.1%
 
-0.02 323 1.3%
 

Maximum 5 values

Value Count Frequency (%)  
0.19 226 0.9%
 
0.21 372 1.5%
 
0.23 677 2.7%
 
0.3 50 0.2%
 
4.68 230 0.9%
 

10yrs_treynor_ratio_fund
Categorical

Distinct count 2753
Unique (%) 11.0%
Missing (%) 34.3%
Missing (n) 8584
7.7
 
30
14.92
 
26
7.42
 
23
Other values (2749)
16337
(Missing)
8584
Value Count Frequency (%)  
7.7 30 0.1%
 
14.92 26 0.1%
 
7.42 23 0.1%
 
7.6 22 0.1%
 
14.24 22 0.1%
 
13.42 22 0.1%
 
8.48 21 0.1%
 
12.02 21 0.1%
 
14.52 21 0.1%
 
12.37 21 0.1%
 
Other values (2742) 16187 64.7%
 
(Missing) 8584 34.3%
 

fund_id
Categorical, Unique

First 3 values
e7dff334-3313-4348-917a-64c631da08f1
abf7f06e-6d96-4016-a9c8-2c7975ecf778
0edb76db-aca6-4b0f-8e4e-772674e188fa
Last 3 values
5c653690-cbea-4370-908e-582b0c74cc2d
c97e052e-0f2d-42bb-bacd-f58e116d4c85
819f40d9-f07d-480d-9be8-045999bbb7f5

First 10 values

Value Count Frequency (%)  
0002e898-709a-4b80-8f5c-ec846feff26c 1 0.0%
 
00070160-01a2-4ad3-9290-958a110c8e9f 1 0.0%
 
0009d9da-6735-46c1-81cd-dbc62c53c2e2 1 0.0%
 
000ad9cc-3f7e-48f3-a1f1-4f5c03d3eb6d 1 0.0%
 
000b6091-3c16-41a1-9df4-fce73767dd21 1 0.0%
 

Last 10 values

Value Count Frequency (%)  
fff6de73-cbbd-4814-a59a-f0210d669eae 1 0.0%
 
fff75f2a-1419-4d65-a68f-89d601d47350 1 0.0%
 
fff79179-2ca5-4f26-a023-929c255aeda4 1 0.0%
 
fffb0e0f-2dc9-4e86-b534-476f9669720b 1 0.0%
 
fffe9b65-2288-4d99-844e-89e7747aa323 1 0.0%
 

Correlations

Sample

10years_category_r_squared 10yrs_sharpe_ratio_fund 10_years_alpha_fund 10years_fund_r_squared 10years_fund_std 10yrs_sharpe_ratio_category 10_years_beta_fund 10yrs_treynor_ratio_fund fund_id 10_years_return_mean_annual_category 10yrs_treynor_ratio_category 10_years_return_fund 10_years_alpha_category 10_years_beta_category 10years_category_std 10_years_return_mean_annual_fund 10_years_return_category
0 0.49 NaN NaN NaN NaN 0.01 NaN NaN 264614c6-5ac3-4146-ba26-1674b136cb40 0.01 0.21 0.00 0.06 0.01 0.13 NaN 14.30
1 0.88 1.16 0.16 91.68 14.30 0.01 1.08 15.57 f5ad58c2-fdea-4087-8678-e04744f89f90 0.01 0.15 17.25 -0.01 0.01 0.14 1.42 15.94
2 0.88 1.22 1.00 90.69 12.68 0.01 0.95 16.58 3c13f4ab-02c4-4ca7-a133-7e996ec5d0c4 0.01 0.15 16.21 -0.01 0.01 0.14 1.33 15.94
3 0.90 1.20 0.75 89.03 11.21 0.01 0.84 16.38 ff78bdd8-59eb-4cef-9f3c-b1baacce9554 0.01 0.14 14.12 -0.02 0.01 0.13 1.16 13.68
4 0.97 NaN NaN NaN NaN 0.01 NaN NaN 63d8406d-c525-494a-8e03-d4fc4cfcb571 0.01 0.08 0.00 -0.02 0.01 0.12 NaN 11.53

Attribute Information:

  • default 1 - yes 0 - no

  • account_check_status: (qualitative)

         Status of existing checking account
             A11 :      ... <    0 DM  (DM - Deutsch Mark)
         A12 : 0 <= ... <  200 DM
         A13 :      ... >= 200 DM /
           salary assignments for at least 1 year
             A14 : no checking account
  • duration_in_month: (numerical)

        Duration in month
  • credit_history: (qualitative)

        Credit history
        A30 : no credits taken/
          all credits paid back duly
            A31 : all credits at this bank paid back duly
        A32 : existing credits paid back duly till now
            A33 : delay in paying off in the past
        A34 : critical account/
          other credits existing (not at this bank)
  • purpose: (qualitative)

        Purpose
        A40 : car (new)
        A41 : car (used)
        A42 : furniture/equipment
        A43 : radio/television
        A44 : domestic appliances
        A45 : repairs
        A46 : education
        A47 : (vacation - does not exist?)
        A48 : retraining
        A49 : business
        A410 : others
  • credit_amount: (numerical)

        Credit amount
  • savings: (qualitative)

        Savings account/bonds
        A61 :          ... <  100 DM
        A62 :   100 <= ... <  500 DM
        A63 :   500 <= ... < 1000 DM
        A64 :          .. >= 1000 DM
            A65 :   unknown/ no savings account
  • present_emp_since: (qualitative)

        Present employment since
        A71 : unemployed
        A72 :       ... < 1 year
        A73 : 1  <= ... < 4 years  
        A74 : 4  <= ... < 7 years
        A75 :       .. >= 7 years
  • installment_as_income_perc: (numerical)

        Installment rate in percentage of disposable income
  • personal_status_sex: (qualitative)

        Personal status and sex
        A91 : male   : divorced/separated
        A92 : female : divorced/separated/married
            A93 : male   : single
        A94 : male   : married/widowed
        A95 : female : single
  • present_res_since: (numerical)

        Present residence since         
  • property: (qualitative)

        Property
        A121 : real estate
        A122 : if not A121 : building society savings agreement/
                 life insurance
            A123 : if not A121/A122 : car or other, not in attribute 6
        A124 : unknown / no property
  • age: (numerical)

        Age in years
  • other_installment_plans: (qualitative)

        Other installment plans 
        A141 : bank
        A142 : stores
        A143 : none
  • housing: (qualitative)

        Housing
        A151 : rent
        A152 : own
        A153 : for free
  • credits_this_bank : (numerical)

            Number of existing credits at this bank
  • job : (qualitative)

        Job
        A171 : unemployed/ unskilled  - non-resident
        A172 : unskilled - resident
        A173 : skilled employee / official
        A174 : management/ self-employed/
           highly qualified employee/ officer
  • people_under_maintenance: (numerical)

        Number of people being liable to provide maintenance for
  • telephone: (qualitative)

        Telephone
        A191 : none
        A192 : yes, registered under the customers name
  • foreign_worker: (qualitative)

        foreign worker
        A201 : yes
        A202 : no
In [3]:
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

print("Size of dataframe is " +color.BOLD+ format(origCreditDf.size) + color.END)
print("Shape(#rows,#columns) of dataframe is "+color.BOLD+ format(origCreditDf.shape) + color.END)
print("Dataframe information \n")
print(origCreditDf.info())
Size of dataframe is 21000
Shape(#rows,#columns) of dataframe is (1000, 21)
Dataframe information 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
default                       1000 non-null int64
account_check_status          1000 non-null object
duration_in_month             1000 non-null int64
credit_history                1000 non-null object
purpose                       1000 non-null object
credit_amount                 1000 non-null int64
savings                       1000 non-null object
present_emp_since             1000 non-null object
installment_as_income_perc    1000 non-null int64
personal_status_sex           1000 non-null object
other_debtors                 1000 non-null object
present_res_since             1000 non-null int64
property                      1000 non-null object
age                           1000 non-null int64
other_installment_plans       1000 non-null object
housing                       1000 non-null object
credits_this_bank             1000 non-null int64
job                           1000 non-null object
people_under_maintenance      1000 non-null int64
telephone                     1000 non-null object
foreign_worker                1000 non-null object
dtypes: int64(8), object(13)
memory usage: 164.1+ KB
None
In [4]:
#checking for missing values
origCreditDf.isnull().sum()
Out[4]:
default                       0
account_check_status          0
duration_in_month             0
credit_history                0
purpose                       0
credit_amount                 0
savings                       0
present_emp_since             0
installment_as_income_perc    0
personal_status_sex           0
other_debtors                 0
present_res_since             0
property                      0
age                           0
other_installment_plans       0
housing                       0
credits_this_bank             0
job                           0
people_under_maintenance      0
telephone                     0
foreign_worker                0
dtype: int64
In [5]:
## Dataset has no missing values. 5 point summary of numerical attributes
origCreditDf.describe().transpose()
Out[5]:
count mean std min 25% 50% 75% max
default 1000.0 0.300 0.458487 0.0 0.0 0.0 1.00 1.0
duration_in_month 1000.0 20.903 12.058814 4.0 12.0 18.0 24.00 72.0
credit_amount 1000.0 3271.258 2822.736876 250.0 1365.5 2319.5 3972.25 18424.0
installment_as_income_perc 1000.0 2.973 1.118715 1.0 2.0 3.0 4.00 4.0
present_res_since 1000.0 2.845 1.103718 1.0 2.0 3.0 4.00 4.0
age 1000.0 35.546 11.375469 19.0 27.0 33.0 42.00 75.0
credits_this_bank 1000.0 1.407 0.577654 1.0 1.0 1.0 2.00 4.0
people_under_maintenance 1000.0 1.155 0.362086 1.0 1.0 1.0 1.00 2.0
In [6]:
obj_origCreditDf=origCreditDf.select_dtypes(include=['object']).copy()
obj_origCreditDf.head(5)
print('defaulters :',origCreditDf['default'].unique())
# Number of 'good' credits (should be 700) and 'bad credits (should be 300)
origCreditDf['default'].value_counts()
defaulters : [0 1]
Out[6]:
0    700
1    300
Name: default, dtype: int64
In [7]:
print("Shape(#rows,#columns) of dataframe is "+color.BOLD+ format(obj_origCreditDf.shape) + color.END)
print(obj_origCreditDf.info())
print(obj_origCreditDf.columns)
Shape(#rows,#columns) of dataframe is (1000, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
account_check_status       1000 non-null object
credit_history             1000 non-null object
purpose                    1000 non-null object
savings                    1000 non-null object
present_emp_since          1000 non-null object
personal_status_sex        1000 non-null object
other_debtors              1000 non-null object
property                   1000 non-null object
other_installment_plans    1000 non-null object
housing                    1000 non-null object
job                        1000 non-null object
telephone                  1000 non-null object
foreign_worker             1000 non-null object
dtypes: object(13)
memory usage: 101.6+ KB
None
Index(['account_check_status', 'credit_history', 'purpose', 'savings',
       'present_emp_since', 'personal_status_sex', 'other_debtors', 'property',
       'other_installment_plans', 'housing', 'job', 'telephone',
       'foreign_worker'],
      dtype='object')
In [8]:
#Let's see possible values of  categrical variables in data

print('account_check_status        :',obj_origCreditDf['account_check_status'].unique())
print('credit_history :',obj_origCreditDf['credit_history'].unique())
print('purpose :',obj_origCreditDf['purpose'].unique())
print('savings :',obj_origCreditDf['savings'].unique())
print('present_emp_since :',obj_origCreditDf['present_emp_since'].unique())
print('personal_status_sex :',obj_origCreditDf['personal_status_sex'].unique())
print('other_debtors :',obj_origCreditDf['other_debtors'].unique())
print('property :',obj_origCreditDf['property'].unique())
print('other_installment_plans :',obj_origCreditDf['other_installment_plans'].unique())
print('housing :',obj_origCreditDf['housing'].unique())
print('job :',obj_origCreditDf['job'].unique())
print('telephone :',obj_origCreditDf['telephone'].unique())
print('foreign_worker :',obj_origCreditDf['foreign_worker'].unique())
account_check_status        : ['< 0 DM' '0 <= ... < 200 DM' 'no checking account'
 '>= 200 DM / salary assignments for at least 1 year']
credit_history : ['critical account/ other credits existing (not at this bank)'
 'existing credits paid back duly till now'
 'delay in paying off in the past'
 'no credits taken/ all credits paid back duly'
 'all credits at this bank paid back duly']
purpose : ['domestic appliances' '(vacation - does not exist?)' 'radio/television'
 'car (new)' 'car (used)' 'business' 'repairs' 'education'
 'furniture/equipment' 'retraining']
savings : ['unknown/ no savings account' '... < 100 DM' '500 <= ... < 1000 DM '
 '.. >= 1000 DM ' '100 <= ... < 500 DM']
present_emp_since : ['.. >= 7 years' '1 <= ... < 4 years' '4 <= ... < 7 years' 'unemployed'
 '... < 1 year ']
personal_status_sex : ['male : single' 'female : divorced/separated/married'
 'male : divorced/separated' 'male : married/widowed']
other_debtors : ['none' 'guarantor' 'co-applicant']
property : ['real estate'
 'if not A121 : building society savings agreement/ life insurance'
 'unknown / no property'
 'if not A121/A122 : car or other, not in attribute 6']
other_installment_plans : ['none' 'bank' 'stores']
housing : ['own' 'for free' 'rent']
job : ['skilled employee / official' 'unskilled - resident'
 'management/ self-employed/ highly qualified employee/ officer'
 'unemployed/ unskilled - non-resident']
telephone : ['yes, registered under the customers name ' 'none']
foreign_worker : ['yes' 'no']
In [9]:
#Lets check correlation among columns of dataframe.
from dython.nominal import associations
corr_df=associations(origCreditDf, nominal_columns=['default','account_check_status', 'credit_history', 'purpose', 'savings',
       'present_emp_since', 'personal_status_sex', 'other_debtors', 'property',
       'other_installment_plans', 'housing', 'job', 'telephone',
       'foreign_worker'], mark_columns=True, theil_u=True, plot=True, return_results=True)
In [10]:
corr_df
Out[10]:
default (nom) account_check_status (nom) duration_in_month (con) credit_history (nom) purpose (nom) credit_amount (con) savings (nom) present_emp_since (nom) installment_as_income_perc (con) personal_status_sex (nom) ... present_res_since (con) property (nom) age (con) other_installment_plans (nom) housing (nom) credits_this_bank (con) job (nom) people_under_maintenance (con) telephone (nom) foreign_worker (nom)
default (nom) 1.000000 0.107500 0.214927 0.049493 0.028247 0.154739 0.031902 0.014867 0.072404 0.007728 ... 0.002967 0.019273 0.091127 0.010071 0.014471 0.045732 0.001517 0.003015 0.001093 0.006607
account_check_status (nom) 0.052573 1.000000 0.118855 0.024831 0.027883 0.145556 0.037330 0.011421 0.074606 0.005207 ... 0.108725 0.006893 0.090730 0.001695 0.007726 0.097804 0.006513 0.076944 0.002639 0.002581
duration_in_month (con) 0.214927 0.118855 1.000000 0.194654 0.273692 0.624984 0.105586 0.093996 0.074749 0.133419 ... 0.034067 0.304274 -0.036136 0.077902 0.192174 -0.011284 0.218688 -0.023834 0.164718 0.138196
credit_history (nom) 0.025480 0.026139 0.194654 1.000000 0.039900 0.193283 0.009393 0.017312 0.072874 0.011302 ... 0.098787 0.007824 0.176836 0.030006 0.007789 0.595094 0.005593 0.097687 0.002144 0.003431
purpose (nom) 0.009335 0.018842 0.273692 0.025614 1.000000 0.370954 0.014289 0.015998 0.182953 0.018441 ... 0.151836 0.033483 0.171765 0.010225 0.022393 0.146968 0.029645 0.163750 0.013384 0.007179
credit_amount (con) 0.154739 0.145556 0.624984 0.193283 0.370954 1.000000 0.129507 0.111905 -0.271316 0.187014 ... 0.028926 0.318339 0.032716 0.048336 0.201812 0.020795 0.334607 0.017142 0.276995 0.050050
savings (nom) 0.016658 0.039858 0.105586 0.009528 0.022578 0.129507 1.000000 0.013570 0.046553 0.004986 ... 0.099015 0.007897 0.112603 0.000398 0.001975 0.074588 0.006799 0.033914 0.003656 0.000748
present_emp_since (nom) 0.006079 0.009549 0.093996 0.013751 0.019794 0.111905 0.010627 1.000000 0.140501 0.028665 ... 0.325431 0.020539 0.409607 0.003227 0.019023 0.154743 0.062587 0.097989 0.007580 0.003070
installment_as_income_perc (con) 0.072404 0.074606 0.074749 0.072874 0.182953 -0.271316 0.046553 0.140501 1.000000 0.143033 ... 0.049302 0.055589 0.058266 0.057177 0.094890 0.021669 0.111352 -0.071207 0.014413 0.090024
personal_status_sex (nom) 0.004445 0.006125 0.133419 0.012628 0.032098 0.187014 0.005493 0.040323 0.143033 1.000000 ... 0.113764 0.022221 0.245809 0.003382 0.040270 0.118680 0.009094 0.284250 0.003767 0.002179
other_debtors (nom) 0.008909 0.033111 0.048387 0.029159 0.061568 0.100164 0.036189 0.020880 0.014840 0.006117 ... 0.028335 0.056189 0.030888 0.008428 0.011689 0.025712 0.022499 0.048008 0.008195 0.013679
present_res_since (con) 0.002967 0.108725 0.034067 0.098787 0.151836 0.028926 0.099015 0.325431 0.049302 0.113764 ... 1.000000 0.191575 0.266419 0.055319 0.307190 0.089625 0.035411 0.042643 0.095359 0.054097
property (nom) 0.008720 0.006377 0.304274 0.006876 0.045841 0.318339 0.006843 0.022727 0.055589 0.017479 ... 0.191575 1.000000 0.224743 0.004725 0.166006 0.018524 0.041165 0.094770 0.014674 0.008211
age (con) 0.091127 0.090730 -0.036136 0.176836 0.171765 0.032716 0.112603 0.409607 0.058266 0.245809 ... 0.266419 0.224743 1.000000 0.047069 0.307002 0.149254 0.164476 0.118201 0.145259 0.006151
other_installment_plans (nom) 0.010507 0.003616 0.077902 0.060810 0.032278 0.048336 0.000796 0.008232 0.057177 0.006134 ... 0.055319 0.010896 0.047069 1.000000 0.016603 0.050290 0.008896 0.077224 0.000864 0.003132
housing (nom) 0.011197 0.012223 0.192174 0.011707 0.052427 0.201812 0.002926 0.035994 0.094890 0.054167 ... 0.307190 0.283880 0.307002 0.012313 1.000000 0.058105 0.019500 0.126136 0.008705 0.005728
credits_this_bank (con) 0.045732 0.097804 -0.011284 0.595094 0.146968 0.020795 0.074588 0.154743 0.021669 0.118680 ... 0.089625 0.018524 0.149254 0.050290 0.058105 1.000000 0.060502 0.109667 0.065553 0.009717
job (nom) 0.000946 0.008303 0.218688 0.006774 0.055931 0.334607 0.008119 0.095435 0.111352 0.009858 ... 0.035411 0.056728 0.164476 0.005317 0.015714 0.060502 1.000000 0.145956 0.098356 0.005135
people_under_maintenance (con) 0.003015 0.076944 -0.023834 0.097687 0.163750 0.017142 0.033914 0.097989 -0.071207 0.284250 ... 0.042643 0.094770 0.118201 0.077224 0.126136 0.109667 0.145956 1.000000 0.014753 0.077071
telephone (nom) 0.000990 0.004887 0.164718 0.003772 0.036672 0.276995 0.006340 0.016786 0.014413 0.005931 ... 0.095359 0.029368 0.145259 0.000750 0.010187 0.065553 0.142838 0.014753 1.000000 0.009860
foreign_worker (nom) 0.025499 0.020364 0.138196 0.025721 0.083834 0.050050 0.005530 0.028978 0.090024 0.014621 ... 0.054097 0.070038 0.006151 0.011586 0.028568 0.009717 0.031781 0.077071 0.042023 1.000000

21 rows × 21 columns

In [11]:
# We will ignore very weak correlations
#0.00-0.19: very weak
#0.20-0.39: weak
#0.40-0.59: moderate 
#0.60-0.79: strong
#0.80-1.00: very strong.

corr_triu = corr_df.where(~np.tril(np.ones(corr_df.shape)).astype(np.bool))
corr_triu = corr_triu.stack()
corr_triu[corr_triu > 0.19]
Out[11]:
default (nom)              duration_in_month (con)           0.214927
duration_in_month (con)    credit_history (nom)              0.194654
                           purpose (nom)                     0.273692
                           credit_amount (con)               0.624984
                           property (nom)                    0.304274
                           housing (nom)                     0.192174
                           job (nom)                         0.218688
credit_history (nom)       credit_amount (con)               0.193283
                           credits_this_bank (con)           0.595094
purpose (nom)              credit_amount (con)               0.370954
credit_amount (con)        property (nom)                    0.318339
                           housing (nom)                     0.201812
                           job (nom)                         0.334607
                           telephone (nom)                   0.276995
present_emp_since (nom)    present_res_since (con)           0.325431
                           age (con)                         0.409607
personal_status_sex (nom)  age (con)                         0.245809
                           people_under_maintenance (con)    0.284250
present_res_since (con)    property (nom)                    0.191575
                           age (con)                         0.266419
                           housing (nom)                     0.307190
property (nom)             age (con)                         0.224743
age (con)                  housing (nom)                     0.307002
dtype: float64
In [12]:
origCreditDf.columns
Out[12]:
Index(['default', 'account_check_status', 'duration_in_month',
       'credit_history', 'purpose', 'credit_amount', 'savings',
       'present_emp_since', 'installment_as_income_perc',
       'personal_status_sex', 'other_debtors', 'present_res_since', 'property',
       'age', 'other_installment_plans', 'housing', 'credits_this_bank', 'job',
       'people_under_maintenance', 'telephone', 'foreign_worker'],
      dtype='object')
In [13]:
#Let's drop irrelevant columns (columns which were part of very weak corelations)
selColumns_CreditDf=origCreditDf.copy()
selColumns_CreditDf=selColumns_CreditDf.drop(["account_check_status","duration_in_month","savings","installment_as_income_perc","other_debtors",
              "other_installment_plans","telephone","foreign_worker"],axis=1)
print(selColumns_CreditDf.info())
print(selColumns_CreditDf.columns)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
default                     1000 non-null int64
credit_history              1000 non-null object
purpose                     1000 non-null object
credit_amount               1000 non-null int64
present_emp_since           1000 non-null object
personal_status_sex         1000 non-null object
present_res_since           1000 non-null int64
property                    1000 non-null object
age                         1000 non-null int64
housing                     1000 non-null object
credits_this_bank           1000 non-null int64
job                         1000 non-null object
people_under_maintenance    1000 non-null int64
dtypes: int64(6), object(7)
memory usage: 101.6+ KB
None
Index(['default', 'credit_history', 'purpose', 'credit_amount',
       'present_emp_since', 'personal_status_sex', 'present_res_since',
       'property', 'age', 'housing', 'credits_this_bank', 'job',
       'people_under_maintenance'],
      dtype='object')
In [14]:
#we will look into all the boxplot individually to trace out outliers
ax = sns.boxplot(data=selColumns_CreditDf, orient="h")
In [15]:
# Boxplots show presence of outliers as whiskers can be seen. We will treat outlier by using Inter quantile range.
# Let's normalize colmns for age and credit amount using boxcox
from scipy import stats
selColumns_CreditDf['age']= stats.boxcox(selColumns_CreditDf['age'])[0].astype(int)
selColumns_CreditDf['credit_amount']=stats.boxcox(selColumns_CreditDf['credit_amount'])[0].astype(int)
ax = sns.boxplot(data=selColumns_CreditDf, orient="h")
In [16]:
#encoding the categorical variables
encoded_creditdf=pd.get_dummies(selColumns_CreditDf, columns=['credit_history','purpose',
       'present_emp_since','personal_status_sex','property','credits_this_bank', 'housing','job'])
print(encoded_creditdf.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 44 columns):
default                                                                       1000 non-null int64
credit_amount                                                                 1000 non-null int32
present_res_since                                                             1000 non-null int64
age                                                                           1000 non-null int32
people_under_maintenance                                                      1000 non-null int64
credit_history_all credits at this bank paid back duly                        1000 non-null uint8
credit_history_critical account/ other credits existing (not at this bank)    1000 non-null uint8
credit_history_delay in paying off in the past                                1000 non-null uint8
credit_history_existing credits paid back duly till now                       1000 non-null uint8
credit_history_no credits taken/ all credits paid back duly                   1000 non-null uint8
purpose_(vacation - does not exist?)                                          1000 non-null uint8
purpose_business                                                              1000 non-null uint8
purpose_car (new)                                                             1000 non-null uint8
purpose_car (used)                                                            1000 non-null uint8
purpose_domestic appliances                                                   1000 non-null uint8
purpose_education                                                             1000 non-null uint8
purpose_furniture/equipment                                                   1000 non-null uint8
purpose_radio/television                                                      1000 non-null uint8
purpose_repairs                                                               1000 non-null uint8
purpose_retraining                                                            1000 non-null uint8
present_emp_since_.. >= 7 years                                               1000 non-null uint8
present_emp_since_... < 1 year                                                1000 non-null uint8
present_emp_since_1 <= ... < 4 years                                          1000 non-null uint8
present_emp_since_4 <= ... < 7 years                                          1000 non-null uint8
present_emp_since_unemployed                                                  1000 non-null uint8
personal_status_sex_female : divorced/separated/married                       1000 non-null uint8
personal_status_sex_male : divorced/separated                                 1000 non-null uint8
personal_status_sex_male : married/widowed                                    1000 non-null uint8
personal_status_sex_male : single                                             1000 non-null uint8
property_if not A121 : building society savings agreement/ life insurance     1000 non-null uint8
property_if not A121/A122 : car or other, not in attribute 6                  1000 non-null uint8
property_real estate                                                          1000 non-null uint8
property_unknown / no property                                                1000 non-null uint8
credits_this_bank_1                                                           1000 non-null uint8
credits_this_bank_2                                                           1000 non-null uint8
credits_this_bank_3                                                           1000 non-null uint8
credits_this_bank_4                                                           1000 non-null uint8
housing_for free                                                              1000 non-null uint8
housing_own                                                                   1000 non-null uint8
housing_rent                                                                  1000 non-null uint8
job_management/ self-employed/ highly qualified employee/ officer             1000 non-null uint8
job_skilled employee / official                                               1000 non-null uint8
job_unemployed/ unskilled - non-resident                                      1000 non-null uint8
job_unskilled - resident                                                      1000 non-null uint8
dtypes: int32(2), int64(3), uint8(39)
memory usage: 69.4 KB
None
In [17]:
# Split Train/Test data 70:30 ratio
from sklearn.model_selection import train_test_split
#separating target column
y = encoded_creditdf['default']
#removing target column from features
X = encoded_creditdf.loc[:, encoded_creditdf.columns != 'default']
#70:30 train test division
X_train, X_test, y_train, y_test = train_test_split( encoded_creditdf, y, test_size=0.3, random_state=42,)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(700, 44) (300, 44) (700,) (300,)
In [18]:
# Randomforest Model without parameter tuning
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=28)

from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

rf=rf.fit(X_train, y_train)
preds = rf.predict_proba(X_test)[:,1]
y_pred=rf.predict(X_test)
Parameters currently in use:

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': 28,
 'verbose': 0,
 'warm_start': False}
C:\Users\phlegmatic\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [19]:
#calculate Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

def calculate_confusion_matrix(y_true, y_pred):
    cm=confusion_matrix(y_true, y_pred)
    print(cm)
    
calculate_confusion_matrix(y_test, y_pred)
print(accuracy_score(y_test, y_pred))    
[[209   0]
 [  2  89]]
0.9933333333333333
In [20]:
# View a list of the features and their importance scores
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:15]
a = encoded_creditdf.columns[:]
features= a.drop('default',1)
#plot it
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
Out[20]:
Text(0.5, 0, 'Relative Importance')

As we can see credit amount, credit history delay, age are the important features determined by the model to classify the person profile.

In [21]:
trainResult = rf.score(X_train, y_train)
testResult = rf.score(X_test, y_test)
print("Train Accuracy:",(trainResult*100.0))
print("Test Accuracy:" ,(testResult*100.0))
Train Accuracy: 100.0
Test Accuracy: 99.33333333333333
In [22]:
#Hyper Parameter tuning
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)
{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
In [23]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   20.6s finished
Out[23]:
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None,
                                                    oob_sc...
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=2)
In [24]:
#best parameters from fitting the random search:
rf_random.best_params_
Out[24]:
{'n_estimators': 200,
 'min_samples_split': 10,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 50,
 'bootstrap': True}
In [25]:
#Evaluate Random Search
#To determine if random search yielded a better model, we compare the base model with the best random search model.
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy
base_model = RandomForestClassifier(n_estimators = 10, random_state = 42)
base_model.fit(X_train, y_train)
base_accuracy = evaluate(base_model, X_test, y_test)
Model Performance
Average Error: 0.0167 degrees.
Accuracy = 94.51%.
In [26]:
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))
Model Performance
Average Error: 0.0000 degrees.
Accuracy = 100.00%.
Improvement of 5.81%.
In [27]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search using 3 folds
param_grid = {
    'bootstrap': [True],
    'max_depth': [40, 50, 60],
    'max_features': [2, 3],
    'min_samples_leaf': [2,3,4],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100,200]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model with kfold where k=3
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
In [28]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_
Fitting 3 folds for each of 108 candidates, totalling 324 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   12.1s
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:   24.9s finished
C:\Users\phlegmatic\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[28]:
{'bootstrap': True,
 'max_depth': 50,
 'max_features': 3,
 'min_samples_leaf': 2,
 'min_samples_split': 8,
 'n_estimators': 200}
In [29]:
best_grid = grid_search.best_estimator_
grid_accuracy = evaluate(best_grid, X_test, y_test)
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))
Model Performance
Average Error: 0.0033 degrees.
Accuracy = 98.90%.
Improvement of 4.65%.
In [30]:
num_folds = 5
seed = 28
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=num_folds, random_state=seed)

results = cross_val_score(best_grid, X_train, y_train, cv=kfold)
print(results)
print("All column cross_val_score: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.99285714 0.98571429 1.         1.         0.99285714]
All column cross_val_score: 99.429% (0.535%)

Findings

Because we have normalized key column like age, credit_amount using boxcox method. Also the weak corelations between categorical columns allowed us to drop certain columns and make our model simpler.

Hyper parameter tuning using random search as well as GridSearchCV improved the accuracy of our model further. K-fold validation gives 99% accuracy to our model which gives us lot of confidence to classify the person profile as good credit or bad credit.